Words and actions
Computers are now ‘learning’ to describe what’s happening in images
Full access isn’t far.
We can’t release more of our sound journalism without a subscription, but we can make it easy for you to come aboard.
Get started for as low as $3.99 per month.
Current WORLD subscribers can log in to access content. Just go to "SIGN IN" at the top right.
LET'S GOAlready a member? Sign in.
Comedian Steve Martin once quipped that his sentences brightened up once he started using verbs. It was a joke, of course, but describing reality is more than just identifying objects, it’s identifying what those objects are doing. Now, with recent advances in image recognition technology, computers may be able to describe reality as well as—or better than—humans.
Last month, researchers at Stanford University and Google—each group working independently—announced the development of artificial intelligence software capable of not only identifying objects in images, but what those objects are doing. For example, an image described by a human as, “A group of men playing Frisbee in the park,” was captioned by the computer as, “A group of young people playing a game of Frisbee.”
The researchers found that the computer-generated descriptions were consistently accurate compared with human observations. Such a capability could lead to far more efficient cataloging and searching of billions of online images and videos. Current search engine technology relies on human-generated written language descriptions of the image contents.
How does a computer learn to recognize multiple objects in an image, what those objects are doing, and then write an accurate description—all on its own? The key is the rapidly advancing technology of Artificial Neural Networks (ANN)—software programs inspired by the architecture of the human brain.
The Stanford and Google researchers combined two types of ANN, one focused on image recognition and the other on natural language processing. Next, they presented thousands of different images and descriptions to the hybrid ANN, “training” it to recognize patterns in the picture/description pairs. They then gave the ANN a set of pictures it hadn’t yet “seen” to determine how well it had learned.
“I was amazed that even with the small amount of training data that we were able to do so well,” said Oriol Vinyals, a Google computer scientist and a member of the Google Brain project, to The New York Times. “The field is just starting, and we will see a lot of increases.”
But while the technology is considered “artificial intelligence,” it’s still merely highly sophisticated pattern recognition. “I don’t know that I would say this is ‘understanding’ in the sense we want,” said IBM researcher John R. Smith in the Times article. “I think even the ability to generate language here is very limited.”
Play stations
New York City still has 6,400 coin-operated pay phones, most of which languish, unused. But that may be about to change. Next year, the city embarks on an elaborate plan to replace its aging and outdated pay phones with a network of 10,000 kiosks that will provide free, 150-foot-radius high-speed wireless internet, free phone calls within the United States, a free charging station for phones, and a touchscreen tablet to access directions and city services.
The new system, called “LinkNYC,” is a partnership between the mayor’s office and a consortium of technology, manufacturing, and advertising companies. Although the project will cost about $200 million, officials expect advertising revenue from the kiosks to generate more than $500 million over the next 12 years.
Most New Yorkers won’t miss the old-school pay phones. “I’m cool with it,” Miriam Dumlao, an East Village–based musician, told The Wall Street Journal. “Like, I don’t miss my tape player.” —M.C.
Please wait while we load the latest comments...
Comments
Please register, subscribe, or log in to comment on this article.