|Caption: “A person riding a motorcycle on a dirt road.” Source: Io9|
On top of that, a computer must be able to sort out the salient features of that object and identify what it is—what category it belongs to. Even more difficult is the ability to explain the relationship between objects—what's going on. Finally, in order to create a caption for an image, the computer also needs to be able to translate its understanding into natural sounding language.
|Caption: “Two pizzas sitting on top of a stove top oven.” Source: Io9|
The human's answer is better because he or she recognized that there were three different kinds of pizza, and that the pizzas were resting on a stove, not a "stove top oven."
At this stage, computers don't always get the captions right, and it's fascinating to see how they get it wrong. For example, the computer mistakenly believed the child in the knitted hat was blowing bubbles.
The problem all along with developing computer vision was that programmers were trying to solve it top-down by telling the computer what it needed to do. Part of the solution has been a bottom-up approach using deep learning to allow the computer to rapidly improve its performance.
Computer vision presents us with some immediate potential benefits: artificial systems will be able to help blind people, assist in manufacturing, and drive us around safely in cars.
But artificial intelligence in its darker potential manifestations presents an existential threat to humans, outlined in a current article "The Doomsday Invention" in the New Yorker, and in this TED talk (link to YouTube)
Computer vision on Wikipedia