Gurney Journey: Computers are learning to caption photos

Permissions

All images and text are copyright 2020 James Gurney and/or their respective owners. Dinotopia is a registered trademark of James Gurney. For use of text or images in traditional print media or for any commercial licensing rights, please email me for permission.

However, you can quote images or text without asking permission on your educational or non-commercial blog, website, or Facebook page as long as you give me credit and provide a link back. Students and teachers can also quote images or text for their non-commercial school activity. It's also OK to do an artistic copy of my paintings as a study exercise without asking permission.

Saturday, November 21, 2015

Computers are learning to caption photos

For decades, one of the apparently insurmountable challenges in artificial intelligence was getting a machine to see.

Caption: “A person riding a motorcycle on a dirt road.” Source: Io9

In order to approach the human capabilities of vision, a computer must be able to distinguish objects from their surroundings in a wide range of environments, even if those objects are partially obscured or shadowed, or turned in weird angles.

On top of that, a computer must be able to sort out the salient features of that object and identify what it is—what category it belongs to. Even more difficult is the ability to explain the relationship between objects—what's going on. Finally, in order to create a caption for an image, the computer also needs to be able to translate its understanding into natural sounding language.

Caption: “Two pizzas sitting on top of a stove top oven.” Source: Io9

Can computers do it? They already have. The caption on the images above was generated a year ago by a computer, not by a human. The human caption for the picture above was “Three different types of pizza on top of a stove.”

The human's answer is better because he or she recognized that there were three different kinds of pizza, and that the pizzas were resting on a stove, not a "stove top oven."

At this stage, computers don't always get the captions right, and it's fascinating to see how they get it wrong. For example, the computer mistakenly believed the child in the knitted hat was blowing bubbles.

The problem all along with developing computer vision was that programmers were trying to solve it top-down by telling the computer what it needed to do. Part of the solution has been a bottom-up approach using deep learning to allow the computer to rapidly improve its performance.

Google has been at the forefront of this research, and here's a link to one of their research papers about how they're getting their computers to auto-caption photos. The process involves not only their object-recognition capability, which they've already had for a few years, but also a syntactic ability that's closely related to their language translation software.

Computer vision presents us with some immediate potential benefits: artificial systems will be able to help blind people, assist in manufacturing, and drive us around safely in cars.

But artificial intelligence in its darker potential manifestations presents an existential threat to humans, outlined in a current article "The Doomsday Invention" in the New Yorker, and in this TED talk (link to YouTube)
------
Computer vision on Wikipedia

3 comments:

Roberto said...: Great Post Mr. Gi. (Nice job w the book-club report too.)
The potential benefits are enormous, especially if you consider a real partnership of a cyborg/human alliance. The benefit to humans arising out of a God-like benefactor/caretaker achieving for us what we cannot do for ourselves or are unwilling to wait/work to achieve for ourselves (especially that bit about ‘Human-values’).
The potential/inevitable devastation is immense when you consider the current threat of cyber-warfare and how well we have mastered that ‘Human-values’ thing. I hope our Franken-Borg will at least keep our litter-box clean, and the Soilent-green palatable. -RQ; November 21, 2015 at 2:40 PM
krystal said...: Thanks for this article. Machine learning is an interest of mine, and I've been spending quite a bit of time learning some of this stuff. The fact that they made Tensor-flow open source a few months ago is huge; it opens the possibility for all sorts of research. They're smart and understand that research and incorporation is being done in these fields all over the world, and they need as much breakthrough as possible, as the field is still pretty new. I did an AI workshop about a month ago where they explained how they interpreted all those street view images and addresses for Google Maps; it's quite fascinating because they do use image recognition embedded with geographical data and machine learning. Andrew Ng is a huge hero in that field; I'm a huge huge fan, as well as the Machine Learning department at Stanford. Ironically writing and reading your article from a Google workshop/HQ lol.; November 21, 2015 at 4:39 PM
David King said...: Resistance is futile.....; November 21, 2015 at 7:11 PM

The Artist's Guide to Sketching

James Gurney

Blog Index

Blog Archive

Tip Jar

Color and Light Book

Imaginative Realism

Other Official Sites

Illustration

Painting and Painters

Drawing & Cartooning

Animation Art

CG Art

Contact

Permissions

Saturday, November 21, 2015

Computers are learning to caption photos

3 comments: