Google Introduces New Generative AI Model IGN Capable of Producing Realistic Images
-
A new paradigm for generative AI models is emerging. UC Berkeley and Google have proposed Idempotent Generative Networks (IGN), capable of generating images in just one step. Currently, generative AI models such as GANs, diffusion models, or consistency models produce images by mapping inputs to outputs that match the target data distribution. Typically, these models need to learn from numerous real images to ensure the generated pictures retain realistic features.
Recently, researchers from UC Berkeley and Google proposed a novel generative model—Idempotent Generative Networks (IGNs). IGNs can generate realistic images from a variety of inputs, such as random noise or simple graphics, in a single step without requiring multiple iterations. This model is designed to function as a 'global projector,' capable of mapping any input data to a target data distribution.
Paper link: https://arxiv.org/abs/2311.01462
Interestingly, an efficient scene from 'Seinfeld' became a source of inspiration for researchers. This scene perfectly illustrates the concept of an 'idempotent operator' - an operation that always yields the same result when applied repeatedly to the same input.
IGN has two important differences from GAN and diffusion models: Unlike GAN, IGN doesn't require separate generator and discriminator - it's a "self-adversarial" model that simultaneously performs generation and discrimination. Unlike diffusion models that execute incremental steps, IGN attempts to map inputs to data distribution in a single step. They reside in the same space, meaning their instances have the same dimensions.
Researchers admit that, at this stage, IGN's generation results cannot compete with the most advanced models. In the experiments, smaller models and lower-resolution datasets were used, with the primary focus on exploring simplified methods. Of course, fundamental generative modeling techniques, such as GANs and diffusion models, also took considerable time to achieve mature, scalable performance. The researchers evaluated IGN on MNIST (grayscale handwritten digit dataset) and CelebA (facial image dataset), using image resolutions of 28×28 and 64×64, respectively.
The researchers employed a simple autoencoder architecture where the encoder is a basic five-layer discriminator backbone from DCGAN, and the decoder is the generator. The training and network hyperparameters are shown in Table 1. Figure 4 displays qualitative results from applying the model once and twice consecutively on two datasets. As illustrated, applying IGN once (f(z)) yields coherent generation results. However, artifacts may occur, such as holes in MNIST digits or distorted pixels in the scalp and hair regions of facial images.
Reapplying the function f (f(f(z))) can correct these issues, filling holes or reducing the total variation around noisy patches in facial images. It shows that when the image approaches the learned manifold, reapplying f results in minimal changes, as the image is considered to be within the distribution. The authors demonstrate that IGN has a consistent latent space, similar to what is shown in GANs, with Figure 6 illustrating the latent space algorithm.
<p class="article-content__img" style="margin-top: 0px; margin-bottom: 28px; padding: 0px; box-sizing: border-box; outline: 0px; border-width: 0px; border-style: solid; border-color: rgb(229, 231, 235); --tw-shadow: 0 0 #0000; --tw-ring-inset: var(--tw-empty, ); --tw-ring-offset-width: 0px; --tw-ring-offset-color: #fff; --tw-ring-color: rgba(41, 110, 228, 0.5); --tw-ring-offset-shadow: 0 0 #0000; --tw-ring-shadow: 0 0 #0000; line-height: 32px; text-align: center; color: rgb(59, 59, 59); word-break: break-word; font-family: 'PingFang SC', 'Microsoft YaHei', Helvetica, 'Hiragino Sans GB', 'WenQuanYi Micro Hei', sans-serif; letter-spacing: 0.5px; display: flex; -webkit-box-align: center; align-items: center; -webkit-box-pack: center; justify-content: center; flex-direction: column; text-wrap: wrap; background-color: rgb(255, 255, 255);"><img src="https://www.cy211.cn/uploads/allimg/20231114/1-231114113215910.png" title="image.png" alt="image.png" style="margin: 0px auto; padding: 0px; box-sizing: border-box; outline: 0px; border: 1px solid rgb(238, 238, 238); --tw-shadow: 0 0 #0000; --tw-ring-inset: var(--tw-empty, ); --tw-ring-offset-width: 0px; --tw-ring-offset-color: #fff; --tw-ring-color: rgba(41, 110, 228, 0.5); --tw-ring-offset-shadow: 0 0 #0000; --tw-ring-shadow: 0 0 #0000; max-width: 700px; background: url('../img/bglogo2.svg') center center no-repeat rgb(247, 248, 249); box-shadow: rgba(27, 95, 160, 0.1) 0px 1px 3px; display: inline-block;"/></p>
Researchers also validated the 'global mapping' potential of IGN by inputting images from various distributions into the model to generate their equivalent 'natural images.' The researchers tested inverse tasks such as denoising noisy images (x + n) and processing grayscale images (original image x), which are ill-posed problems. IGN was able to create natural mappings that conform to the structure of the original images. As shown in the results, applying f iteratively improves image quality (e.g., it removes dark and smoky artifacts in projected sketches). These findings demonstrate that IGN is more efficient in inference, generating results in a single step after training. It can also produce more consistent outputs, which may extend to broader applications, such as medical image restoration.