Thursday, May 4, 2017

Text-to-Image Synthesis

Novel photo-real images generated by an adversarial network of computers
based solely on a written prompt, without human intervention or photo cues.
Low resolution version on top row iterated to higher res on bottom row.
via Olivier Grisel on Twitter
In the ten years of this blog so far we've witnessed startling advances in the ability of computers to create and interpret images.

We've seen systems that can (links to previous posts):
Re-render a photo in the style of any artist.
Identify faces and objects, no matter the lighting, angle, or context.
Render a photo in a painterly way that puts more detail in psychologically salient areas.
Paint a generalized portrait that's typical of the style of Rembrandt.
Generate captions for images describing at a higher level what's going on in a given photo.
Analyze the abstract elements of a target image and then locate other abstractly similar images.

Despite these advances, most of us human picture-makers can still pride ourselves in our unique ability to create a photo-real image based purely on a written description.

Suppose, for example, you were asked to paint a picture of "a small bird with a pink breast and crown, and black primaries and secondaries." Could you do it? And could you render your picture so believably that someone else might mistake if for a real photo?

Computer generated images courtesy

Computers are figuring this out, and they're starting to get good at it. Scientists are approaching the problem of text-to-image synthesis by means of a deep-learning technique called "generative adversarial networks" or GANs for short.

This GAN strategy pits two separate computer networks against each other. The goal of the Generator one is to create images that fit the text prompt, and the goal of the Discriminator is to distinguish synthetic images from real ones.

As the Generator tries to create images to fool the Discriminator, it gets harder, because the Discriminator keeps learning, too. Exactly what the computer "knows" about the structure of form or the aptness of illustrative problem-solving is hard to say because it wasn't taught by a human; it figured it out on its own, in its own way.

Related video: Image Synthesis From Text With Deep Learning

The resulting images are not an average of existing photos. Rather they're completely novel creations.

Furthermore, GAN image synthesizers can be used to create not only real-world images, but also completely original surreal images based on prompts such as: “an anthropomorphic cuckoo clock is taking a morning walk to the pastry market.”

How good are these synthetic illustrations?

So far the images are small (about 64 x 64 pixels) and for the most part, they still won't fool any humans. But watch out: you're just seeing just baby steps.

GANs currently do pretty well generating plausible pictures of birds and flowers, but they have limited success with complex scenes involving human figures, or generalized text prompts such as "a picture of a very clean living room."

They're a bit garbled and incoherent at the moment, but they will develop rapidly. In a few years, advanced A.I. image-creating tools that can illustrate any text prompt in any style will be available cheaply to art buyers everywhere.  

Geek out: 
• A scholarly PDF: Generative Adversarial Text to Image Synthesis
• Related scientific paper about texture synthesis: Precomputed Real-Time Texture Synthesis with Markovian Generative Adversarial Networks
• More images at Creative AI


Kessie said...

Haha, imagine the possibilities. "Generate me a fantasy book cover with two flying dragons and a castle and a girl wizard."

Unknown said...

OK, I'm literally speechless this time. It's not April 1st right? No, it's May 4th. Wow, this one is really unreal. Technology is truly amazing. I don't even know what to say. Thanks for the info James.

Rich said...

"Unreal technology" ...
in 2nd dimension: just in a "baby stage";

"your'e seeing just baby steps"...
to quote James.

What about third dimension?
May be within next decade - when our babies have grown up;-)

Unknown said...
This comment has been removed by the author.
James Gurney said...

Robert, Rich, and Kessie, yes, it's hard to know how to respond emotionally to this. We adjust our brains and our daily lives to so many technological miracles (Google maps, AI, VR, driverless cars, etc) and they come racing at us on an almost daily basis. The prospect of a powerful image synthesizer is beyond wonderful and beyond unsettling at the same time. And it's also a bit unnerving to think that these capacities have been developing, not by the intervention of ingenious human programmers, but rather by two computer systems honing their mysterious skills in their own darkness.

Artists were both threatened and strengthened by the tools that photography—and Photoshop 100 years later—brought us. I suppose we'll be threatened and strengthened by image synthesizers too. How our creative strategies will adapt is impossible to know until this new creature evolves further. I can only hope that we humans will want to embrace the hand-made, brain-made art, and that the bedrock of human ingenuity will remain untouched by artificial simulations.

Anonymous said...

It will be interesting how this affects (or is affected by) copyright law. There is plenty of legally-gray implications of something like this. If they're eventually deriving these images from artists or photographer's work, who gets paid?

Wendy said...