COMPUTER VISION:

Eyes for the Future of AI

CHRIS QUIRK

Aside from the corporate decals plastered on the sides, the unassuming 1990 Pontiac Trans Sport — alias NavLab 5 — could just as easily be filled with kids on their way to soccer practice. But inside, along with a jumble of cables and an open laptop beside the driver’s seat, a front-facing camera lodged on the rearview mirror shuttles images of the road ahead to as much computer as the 9-volt cigarette lighter can power.

CMU research scientist Dean Pomerleau (right) and (then) graduate student Todd Jochem in front of NavLab 5.

Noted car guy and former late night TV talk show host Jay Leno (left) receives a tour of NavLab 5 outside the Tonight Show studios.

The year was 1995, and the mission, No Hands Across America, a long-distance autonomous driving test in which NavLab 5 would transport CMU research scientist Dean Pomerleau and graduate student Todd Jochem the 3,000 miles from Pittsburgh to San Diego, without a human being so much as touching the steering wheel. At least they hoped. “I told my students, ‘You will share your fate with the software, so you’d better make it good!’” said Takeo Kanade, the Founders University Professor of Computer Science, who was the faculty lead on the project.

The term computer vision covers everything from image classification to facial identification to medical imaging to self-driving vehicles. Martial Hebert, dean of the School of Computer Science and a computer vision pioneer who developed the range sensor for NavLab, defines the diverse range of applications with elegant simplicity. “Computer vision is fundamentally the idea of extracting higher level information from visual data.”

As it turned out, during the No Hands Across America drive, NavLab 5 achieved a 98.2% autonomous driving rate. The trek made national news, and even caught the eye of car enthusiast Jay Leno, who invited Pomerleau and Jochem to stop by the Tonight Show studio lot for a visit. Kanade, surmised at the time that full autonomous driving would take three years. “It took 30,” he said. “I was only off by an order of magnitude.”

Despite hopeful beginnings followed by — at times — glacial progress, researchers in computer vision are now probing superhuman possibilities that are expanding the fields of transportation, health, security and more. “We want to change the definition of what a camera is,” said Srinivasa Narasimhan, professor of the Robotics Institute.

Martial Hebert

Martial Hebert, Dean of SCS

Srinivasa Narasimhan

Srinivasa Narasimhan, Professor of the Robotics Institute.

Today, as driverless vehicles free of steering wheels or other manual controls are about to hit the streets, the early days of computer vision seem as distant as dial-up internet. But it bears remembering the herculean labor that was required to achieve even the most elementary tasks of computer vision, and how tomorrow’s possibilities rely on the foundational work of Hebert, Kanade and others.

Kanade and his colleagues began NavLab in true garage-project style — they attached a camera and some hardware to a cart. “There wasn’t even a laptop on board, so the robot cart couldn’t operate on its own. Can you imagine?” Kanade said. A nearby computer sent maneuvering information to the cart using a radio signal, for which the NavLab team had to acquire a broadcast license. “It was tedious,” Kanade recalled. “Initially, we moved the cart one centimeter per second. You could barely tell it was moving.”

By any measure Kanade ranks as one of the groundbreaking figures in computer vision. He built his first major invention, a facial recognition tool, in 1970. To train the system, he took his camera and image digitization system that he had built to the Japan World Exposition in Osaka, where visitors would sit for the 10 seconds it took to digitize their facial image in exchange for a small prize. “We gathered a thousand images,” Kanade said. “At the time, it was probably the world’s biggest image database.”

EyeVision in Super Bowl XXXV

In 2001, CBS Sports brought Kanade and his team at Carnegie Mellon to Super Bowl XXXV, to create a new video replay system. Kanade created EyeVision, which showed viewers action on the field from virtually any angle, rotating around a play to reveal details that single point-of-view cameras could not capture. “There was no real secret to it,” said Kanade. “We had seen that effect in the movie, ‘The Matrix’, and CBS came to me and asked if I could do something similar. The big difference from the movie was that instead of a single position where an actor would stand, we had to cover the whole football field.”

Kanade first set 33 cameras around the stadium, spanning 270 degrees of view. While the principle of the system seemed self-evident to Kanade, the execution was another story. Kanade built a massive computer processing apparatus that swallowed all input from all the cameras — including information on zoom length, angle and image content — and synthesized it into a single, seamless, rotating image for viewers. A camera operator seated at a custom-built fake camera carriage, and watching the game on a video monitor, controlled all the EyeVision cameras in tandem by following the action on screen. On game day, it didn’t take long before the head producer repeatedly went  to EyeVision replays. Nearly 85 million viewers watched the game nationwide, and today almost all major sports use replay systems emulating EyeVision’s innovation for broadcast or replay reviews.

Takeo Kanade

Among the many awards and honors he’s received over his career, Takeo Kanade is shown here in Japan receiving the 2016 Kyoto Prize for Advanced Technology. The international award is presented by the Inamori Foundation to individuals for significant contributions to the scientific, cultural and spiritual betterment of humankind.

Facial Recognition

A lack of computational power limited computer vision research early on. But now, more powerful computer vision systems draw power from more powerful computational systems. “We’ve seen an exponential rate of advances,” said Hebert. “Another development is that over time, researchers have created a set of tools and building blocks for the creation of more complex functionality. You don’t have to reinvent the wheel every time.” Much of Hebert’s early research centered on finding ways to economize information and processing systems so the precious computational capacities of early computer vision systems could be used in full. “Techniques today are very different,” said Hebert. “But the early research helped identify key challenges of computer vision.”

These advances now empower researchers to find new ways to explore the visual world. László Jeni, assistant research professor in the Robotics Institute, has created a face interpretation tool to help people with vision impairment recognize those around them and to be able to interact with them more easily. The tool consists of a small, head-mounted camera and earpiece that the user wears, along with a processing unit on the arm. As someone approaches, the tool scans the person’s face and analyzes their expression, for a smile or consternation. It then gives a simple audio cue as to the emotional state of a person, such as, “Nick is approaching, and looks very happy,” so the wearer can greet and start a conversation with Nick. “It recognizes what we call facial action units. These are elementary facial characteristics, like raising your eyebrows, a smile or a smirk, and each has a separate code,” Jeni explained. “Humans use a lot of different communication channels, and verbal is just one of them.”

In addition, Jeni uses computer vision face mapping analysis as part of a treatment for obsessive-compulsive disorder in patients who have not responded to medication or cognitive behavioral therapy. The therapeutic device employs deep brain stimulation via an electrode embedded in the subcortical area of the brain. The stimulation can help reduce anxiety and distress associated with OCD. The electrode stimulates the targeted brain region while a computerized face mapping tool looks at the expression of the patient and analyzes the emotional response. “The face and body responses are the motor outputs of the brain region you are stimulating,” said Jeni. “You can get a semantic meaning from that and tell whether the treatment is working. It’s very important to objectively measure behavior in order to improve outcomes. The face mapping analysis provides an interpretation of what is happening inside the patient that I can use to evaluate the deep brain stimulation.”

Lazlo Jeni

Lazlo Jeni

Top: A vision-impaired user wearing a head-mounted camera and a processing unit attached to her arm interacts with another person. Bottom: A frame from the corresponding processed video stream, featuring 3D face tracking, face recognition, and automated facial affect recognition of the interacting individual.

Tartan Racing Wins DARPA Challenge, Establishes Autonomous Legacy

CMU’s Tartan Racing team victory in the 2007 Defense Advanced Research Projects Agency’s (DARPA) Grand Challenge was a pivotal event in the development of autonomous vehicles, the repercussions of which we all feel to this day. And CMU’s advanced use of computer vision helped Tartan Racing, led by William “Red” Whittaker, Faculty Emeritus, to victory and showcased the capabilities of autonomous vehicles in complex, real-world scenarios.

DARPA, organized the competition to encourage the development of autonomous vehicles capable of navigating a complex desert course. The goal was to advance the state of the art in autonomous robotics and promote technologies with potential military applications.

Whittaker’s leadership and CMU’s innovative use of computer vision technologies were instrumental in the victory. Being equipped with an advanced array of sensors, which included cameras, lidar and radar, the autonomous vehicle “Boss” held a comprehensive view of its environment — a significant advantage over the competition. Tartan Racing’s computer vision algorithms processed the data from these sensors, allowing Boss to recognize obstacles, navigate the course and make real-time decisions to avoid collisions.

The success of the DARPA Grand Challenge has had a lasting impact on the field of robotics and autonomous systems, and establishing Pittsburgh as a hub for further research and development.

William "Red" Whittaker

Seeing Around and Through

Computer vision researchers now work to see the unseeable — seeing around corners, clarifying images taken amidst impenetrable murk, and using visual information to record sound. Some of these innovations call to mind science fiction.

Narasimhan and Ioannis Gkioulekas, assistant professor of computer science, are creating a photon selection device to register images in foggy or near opaque conditions. “My lab is about seeing through things,” said Narasimhan. In an early project, Narasimhan built a smart headlight that helps drivers navigate rain and snowstorms by detecting the motion of drops or snowflakes, and reducing headlight illumination selectively to decrease glare. More recently, to capture images in murky underwater environments, the duo grabs the photons best suited to create a picture of the target object. “In muddy water, for example, 99.9% of the photons are just scattered,” Narasimhan said. “To get the right photons, we select particular photons that are specularly reflected, according to Fermat’s principle. You look at where the photon is coming from, the time of flight, time of arrival, even thermal information. We are very much interested in using light, sound and heat together.”

Walls are no obstacle for Gkioulekas and company. Expanding on research begun at the Massachusetts Institute of Technology, Gkioulekas developed a camera that reconstructs the picture of a subject behind a separating barrier, using light bounced off an adjacent wall. Using a technique similar to lidar, the camera fires a beam of light off the wall, which then bounces off an object behind the barrier and returns. The camera measures the time it takes for the light beams to get back to the camera and records a timestamp. These intervals are measured in picoseconds, or trillionths of a second. “You shoot maybe a thousand pulses at a time, and go point-by-point to capture depth,” said Gkioulekas. “When you process that you get a very detailed, 3D reconstruction. We’re still working on this, but we’ve done imaging where you can read the word ‘Liberty’ on a quarter from around a corner.”

For an example of sound and vision working together in computer vision, Narasimhan and his colleagues are developing a camera using dual-shutter vibration. This camera picks up the imperceptible effect of sound vibrations on surfaces and separates them. “A microphone will capture all the sound that reaches it. If there are two people talking, or people playing drums or musical instruments, you get the sum of their sounds,” Narasimhan explained. “But we are designing this camera to be a microphone by imaging tiny vibrations invisible to the naked eye or ordinary cameras, and we can measure these extremely tiny vibrations.” It’s the kind of system that could, for example, record a rock show at a club, and from the morass of sound separate and channel each individual instrument into its own audio track for mixing. “For the first time, we were able to capture sound from different directions with a single camera, a single imaging system,” Narasimhan said.

Despite the myriad applications already in development, Hebert said this is just the beginning. “There’s a whole universe of other possibilities, in manufacturing, materials science and medicine, where we could apply some of the techniques we’ve developed in traditional computer vision to move things forward.” Jeni sees digital health in particular as an area with enormous potential. “I’d like to see this technology really make a difference in clinical outcomes and change peoples’ lives,” he said. “I think we’re getting there. The technology works great in lab conditions, but we need to take it out into the real world.”

“At Carnegie Mellon, we have a long history of work on every aspect of computer vision: theory, software and hardware,” said Kanade. “There are many pieces of specialized hardware that we’ve designed and built, and a lot of them have become very popular, like 3D cameras that use stereo vision to analyze depth in real time. We built that in the ‘90s,” he recounted. “We have a great record of innovation, and it’s not an overstatement to say that on computer vision research going back 30 or 40 years, we are the largest, most advanced computer vision institution.”

Narasimhan echoed the pushing forward, building on the foundations of those who have gone before. “What we are about now,” he said, “is creating the cameras of the future.”  ■

Ioannis Gkioulekas (pictured) and Srinivasa Narasimhan are creating a photon selection device to essentially see through objects.