Gurney Journey: New Tools for Text-to-Image Generation

Thursday, July 8, 2021

New Tools for Text-to-Image Generation

Generating an image from a line of text entirely by means computer algorithms has been possible for the last few years. Newly invented tools are yielding results that keep getting more interesting.

The images can be hauntingly surrealistic, such as this one, which was generated by the phrase “when the wind blows.”

Image courtesy The Big Sleep (source: @advadnoun on Twitter)

It's a little blurry and out of focus, with tendrils of downy fluff waving in dim light. It seems more like a photograph than a painting, but really it's a new category of image, made by computer software drawing from big data sets.

Lately people's imaginations have been captured by tools such as VQ-GAN and CLIP.

Prompt: “a face like an M.C. Escher drawing” from The Big Sleep (source: @advadnoun on Twitter)

Some of the results are compelling and intriguing, seemingly intelligent in a weird non-human way, as if you're looking into an alien's mind. Is that a face on its side, an eye, a nose, a mouth? Are those textures fingerprints?

Prompt: “The Yellow Smoke That Rubs Its Muzzle On The Window-Panes”

from VQ-GAN+CLIP (source: @RiversHaveWings on Twitter)

Each solution has a visual logic of theme and variation that's carried throughout the image. It's certainly not random.

Prompt: “A Series Of Tubes” from VQ-GAN+CLIP (source: @RiversHaveWings on Twitter)

Many of the images from this system have a surrealistic patchwork appearance resembling Cubism, where extracted fragments are juxtaposed across the picture plane, but the 3D space doesn't make sense as a real scene.

(source: @ak92501 on Twitter)

Some of the creativity of this enterprise derives from the odd juxtapositions of the words in the prompts. The results are often effective with long prompts. The phrase for the image above is “a small hut in a blizzard near the top of a mountain with one light turn on at dusk trending on artstation | unreal engine”

In recent weeks, people writing prompts realized you can get the system to yield a more detailed style if you say "trending on artstation."

Prompt: "matte painting of someone reading papers and burning the midnight oil | trending on artstation"

by Twitter user @ak92501

I expect that with time the results will be accepted alongside human efforts, beginning perhaps with categories like motel art, Twitter avatars, and corporate clip art. They will take their place on Instagram alongside painters and photographers. Many of the innovators in this field write their own code and come up with remarkably creative prompts, so it makes sense to think of them as artists.

As a viewer, I'm not quite sure how to respond emotionally to something that looks like art, but which didn't pass through a human consciousness.

As an artist, I'm not worried about my job. Maybe it's a vain hope, but I feel like people will always want to see images made by a human hand and filtered through a human brain rather than one made by an unfeeling machine. The question is whether eventually we'll be able to tell the difference.

Thanks, Chris!

Resources to learn more:

• UC Berkeley blog post, which is a good overview of techniques: Alien Dreams: An Emerging Art Scene

• Scientific paper (Free PDF) "Taming Transformers for High-Resolution Image Synthesis"

• Twitter account "Images.AI" which plays with these natural language prompts and some of the same tools.

13 comments:

Joel Fletcher said...: These are fascinating, very cool looking images. But how I respond to them emotionally is like the Uncanny Valley effect. These pieces feel unsettling, almost like an evil impersonation of art! However I am sure this will improve in the future.; July 8, 2021 at 1:24 PM
rock995 said...: This is amazing. Now to read it again (and maybe a 3rd time); July 8, 2021 at 2:25 PM
SummaSummanum said...: Nota Bene

Adding "Unreal Engine" to your description also yields a HQ output.; July 8, 2021 at 2:51 PM
MerylAnnB said...: O. M. G. This is mind blowing. Gotta grok it. Pretty weird, and yet I love it, or at least some of them. I agree that it seems so nonhuman...LOL maybe the robots will prefer this kind of art! But I'm thinking it has possibilities in mixed media...mixed in with some human-created images.

I'm unclear - is there a link or site where you can go to try a little experimentation? Or do you have to get a whole program and fully engage?

btw that last image does not have the correct caption, according to the site it came from. That image was created from the prompt “a small hut in a blizzard near the top of a mountain with one light turn on at dusk trending on artstation | unreal engine”. (The “view from on top of a mountain where you can see a village below at night with the lights on landscape painting trending on artstation | vray” actually is for another image.)

Thanks for this very intriguing post! wow!

Merylannb; July 9, 2021 at 1:03 AM
James Gurney said...: MerylAnnB, Thanks for the correction, which I've made in the post. About trying out the software, I know that usually the scientific papers have open source code that you can try out. But I don't see it here.; July 9, 2021 at 8:26 AM
Unknown said...: @images_ai on Twitter has a link to a tutorial with colab notebook links in their pinned tweet.

Ann; July 9, 2021 at 11:10 AM
kev ferrara said...: To make narrative meaning requires narrative experience. To understand people requires human experience. To understand human emotion requires emotional experience. How do we appreciate a day, or an action if we've never had a day or acted? To a machine, what is the value of the sun on a cold day? What is the value of a moment of beauty in an ugly and depressing milieu? What can a machine say about suffering? Or joy?

Poetry is not random; it connects understanding, feeling, and reference into a narrative unit, into a single concise story. It suggests movements or changes in circumstance and tone, import and complexity. It requires a back-engineering of our abilities to appreciate, which gets to our history of appreciating in real experience; our unique experience as well as our archetypal experience.

Part of the vital charge behind the modernist wave was the toppling of the accepted principles of art. On the one hand, this opens up a wider domain of opportunity, but on the other, when people go wide, they rarely go deep. And concentration is associated with depth.

One of the principles cast aside by modernism was narrative. Also coherence, which goes hand and hand with narrative. Since narrative and coherence are necessary to produce intelligible meaning, meaning too is cast aside. But if meaning is cast aside, why bother concentrating? Randomness and whimsy will do.

It is all a downward spiral. The result is that Gibberish became acceptable as art. Many called it gibberish at the time. And they were overrun by publicity and fashion and intellectual defenses in 'sophisticated' magazines.

Since so much of what Art had done in the past, and so much of its difficulty lay in making meaning aesthetic, it was a bit like dispensing with the net in tennis. Both the game and the difficulty disappear at once.

Random connections shouldn't impress us as art. Technologically, it is a neat trick. Like a monkey on a bicycle; a brief entertainment, but not important to our lives.; July 9, 2021 at 12:09 PM
Bill Marshall said...: Such a well written, and poignant dissertation, kev! A post that could inspire to fill the room with conversations about: "What is Art in Today's Technology?".

Bill; July 11, 2021 at 12:07 AM
James Gurney said...: Kev, if I understand your comment correctly, you're suggesting that art is necessarily a human enterprise because of the way it clearly transmits one person's experience to another, and that modernism took away the ladder of coherence and narrative, leaving us with random gibberish.

Are you then associating this computer-generated art with a kind of debasement of meaning and aesthetic standards?

To me the thing that's interesting about this kind of image-making is that it lives outside of the realm of human aesthetics and art historical tradition. It doesn't carry a lance for any artistic army. You can plug in the name of an art style, but you don't have to. The results look "modern" to us because they superficially resemble cubism, and often it seems like a baby babbling, almost making sense but not quite.

But it's not hard for me to imagine that in a few years these tools will be refined to generate results that DO make perfect narrative sense, drawing on the whole sweep of art history, on photos and videos of the world, or even on direct encounters through active sensory robots. Computers could generate images of fantastical scenes in the style of J.C. Coll, or could paint lonely landscapes that look like Andrew Wyeth, or they could inhabit a novel viewpoint that we've never seen before.

I'm not promoting that idea or suggesting it would be a good thing for artists. But it's powerfully challenging for artists and art historians, for sure.; July 11, 2021 at 8:04 AM
Stephen Berry said...: Art is necessarily a two motive process- it’s a private, personal journey, but with a public, shared result. The visually interesting work the computers here have generated are for the second half. Which is not without value. But what they can never reproduce is the experience we feel while creating.

The experience of making art is often shunted to the side when talking about art, and it’s value to us, but I think it really should be primary. Art should be for making. We should all be making art. I’m not worried about what final products computers (or other human artists, for that matter) can create, because no matter what is produced, no matter how compelling they are, the experience of making it, paying attention, and relating more directly with the world through the making of it, will never have been experienced by me. That’s the central experience. Or should be.; July 13, 2021 at 10:19 AM
kev ferrara said...: But it's not hard for me to imagine that in a few years these tools will be refined to generate results that DO make perfect narrative sense...

I don't quite have the time to unpack every question here, so I'll just make some notes...

Although I harp on this point often it is still rarely talked about, even less well understood, and even harder to explain, but there are vast and essential difference between photos and narrative/imagistic works of art. Not just in how the pictures are manifested technically. That's small potatoes. The differences go down to the atom.

As the camera became a popular pastime, and came into constant usage among artists, and photos began to crowd out illustrations in periodicals, and then as artists began to project and trace photos (with Art Directors requiring quicker turnaround times and more photographic realism)... a fundamental change occurred. Photography began to be seen as the standard method of apprehending the visual world. Even more standard than looking at the world with one's own eyes.

But the purpose of seeing the world with your own eyes is because, well, there's more there than meets the eye. For one, the world is not still. And photographs are. And experience of the world is not still. We are not still. The greatest lie of the photograph is that there is such a thing as an instant of time. But only flow and change is true. The world is also not flat, as photographs are.

I can easily imagine AI creating pictures that make 'photographic sense', where a frozen tableau is presented, where figures wearing particular costumes or outfits stand in a particular setting doing some particular activity, during a particular month under some particular lighting and weather conditions. It may get all that 'right' factually.

But having never walked in a dewey meadow on a misty morning, unable to smell the air, or feel the chill in the air... unable to find any particular configuration of trees against the sky any more compelling or lovely than any other, unable to feel the squishiness of the ground or the wetness of soaked pant-cuff, what can the unfeeling machine say about that beautiful stroll that transmits meaning and emotion in one? How can resonate with our experience without having that experience?

I think what makes narrative compelling is the translation of experience beyond the blatant surface facts into something others can experience. As Harvey Dunn put it, "The only thing that is true about anything is the spirit of it."

Regarding Coll and Wyeth... I can't imagine a computer replicating either style except in the most superficial way. (I don't think it is possible to compose images without utilizing appreciations of personal experiences. The foundation of metaphor is in feeling, not intellect. And computers cannot experience.) Further, I can't imagine a computer coming up with its own style, equal to any great artist in its personal and humane qualities.

It is also often unremarked upon just how real physical tools and materials allow, encourage, and even necessitate artists to develop their own personal ways of handling paper and pigment and rendering. I think digital art tools are the reverse. They are more like a funnel, guiding artists toward stylistic avenues and quirks that are the residue of decisions and understandings of the coders and software engineers who made the programs. I bet I can put together a selection of a hundred painted faces from the golden age of illustration, each done by a different artist, and you will be able to tell instantly that a different artist has done each one, and too which artist has done it. Simply by the style of rendering and drawing and thinking and feeling on the faces alone. As N.C. Wyeth said, "I worship individuality. But a style cannot be manufactured."; July 14, 2021 at 11:18 AM
James Gurney said...: Kev, I can see you've been pondering a lot lately. These are big ideas and important to hash out. We should get together for a cup of coffee sometime in Kingston.; July 14, 2021 at 1:00 PM
Notme said...: A quick comment on kev's point, on the necessity of experience and feeling to compose a metaphorical image:

I was making an experiment with AI and used the input "philosopher meets self-conscious AI". It generated a figure of a robot and what it apparently tried to make of a Greek philosopher in classical painting, with an abstract dash of same color and pattern in each of the figure's heads, facing each other.

The detail that called my attention were that both figures had the same phenomenon on the eye level.

I'm not advocating these AI that generate "art" are self-conscious, but I want to point that they do make a thematic - visual and metaphorical - processing. When one says "philosopher", the AI searches for figures most associated with philosopher: how are they, visually? They're like classical paintings of bearded men, it decided. If one writes "cute style", it asks "what is cute style like"? And generates the appropriate visual output. In this second case, of a rose tinted field of hearts and storybook-like objects. It decided pink hearts were cute based on "cute style".

This means that it is capable of inferring symbols and aesthetic symbolism (more abstract stuff such as color) from diffuse keywords. And it will probably get better on that. Given that it does the fundamental quality needed for a work to have meaning FOR the viewer, the rest will be a question of intensity, or more refined processing.

Now, back to that philosopher-ai image, would it be so absurd that it used symbolism, albeit relatively abstract symbolism, when it already does that with form?

Even if an AI cannot, based on experiment, understand what it generates, we, the viewers, understand what it creates.

It raises a scenario where AI generated art could be visually indistinguishable from human generated art, from the formal to the metaphorical level, because it already processes these two levels in a not-as-refined way.

(Even if its processes do not conceptually understand each).

...

And on that point of copying an artist's style, it's already possible. It's what happens when one says "in the style of artstation/van Gogh/unreal engine" and it gives the corresponding "feel" for it. That it doesn't have personality bias on representation will make it even more efficient. It's a question of refinement.; October 2, 2021 at 2:47 PM