The DALL-E 2 AI image generator has been trending on Twitter in recent weeks. This evening, Google made public its own version, known as Imagen , which combines an unmatched level of photorealism with a profound level of language understanding.
According to According to Jeff Dean, the head of Google AI, these AI systems can unleash the combined creativity of humans and computers. Imagen is one of the directions the firm is going in. The Google Research, Brain Team has improved the amount of realism in its text-to-image diffusion model. DALL-E 2’s output is generally fairly realistic, however a closer examination may show that some artistic liberties were used. (Be sure to visit watch this video explainer for more.)
Imagen relies on the strength of diffusion models for high-fidelity image synthesis and builds on the effectiveness of big transformer language models for text comprehension. Our most important finding is that general large language models (like T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: expanding the language model’s size in Imagen improves sample fidelity and image-text alignment much more than expanding the image diffusion model’s size.
Google developed the DrawBench benchmark for evaluating text-to-image models to demonstrate this development. In side-by-side comparisons, human raters preferred Imagen over other models in terms of sample quality and image-text alignment. It was contrasted with Latent Diffusion Models, DALL-E 2, and VQ-GAN CLIP.
The measures used to demonstrate how much better Imagen is at comprehending user requests include spatial relations, long-form language, uncommon terms, and difficult prompts. Another development is a new Efficient U-Net architecture that is faster to converge, more memory efficient, and more compute efficient.
Without ever having trained on COCO, Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO dataset, and human raters find that Imagen samples are on par with the COCO data itself in terms of image-text alignment.
Regarding the potential for misuse, Google opted against making the Imagen source code or a public demo available at this time. Additionally:
Imagen uses text encoders, which inherit the social biases and constraints of big language models because they are trained on uncurated web-scale data. As a result, there is a chance that Imagen contains damaging preconceptions and representations. For this reason, we have decided against making Imagen available to the general public without further protections.