From Dreams to
Zooms
KEVIN O’CONNELL
The Journey of Alex Waibel and the Rise of Spoken Language Translation
As we were about to begin our interview, Alex Waibel placed two phones on the table — his own, and one to send a recording to a student who would perform a different type of language translation process.
He also set his laptop on the table, opened a Zoom meeting with no online participants and set it to record just the two of us. Waibel helped invent the translation software in Zoom, so what better tool to use for our interview?
We have entered a new era of natural language recognition and processing. But how did we get here?
An Overnight Success Story 45 Years in the Making
When a new technology bursts onto the scene and seems to change our lives overnight, it’s often the culmination of dedication and years of hard work by countless people.
“Something becomes revolutionary when it reaches a level of performance and ease of use, when it’s suddenly in the hands of everyone,” said Waibel.
Spoken Language translation — from instant transcription and translation of Zoom meetings to the ability to produce a dubbed video, making someone speak in any language you prefer — now surrounds us. With new technologies come the serious questions of how to use them for their intended purpose and how to avoid the problem of misappropriation for any nefarious purposes.
Foundations of the Language Technologies Institute
Jaime Carbonell, Founder and Director of CMU’s Language Technologies Institute
The foundations of language technologies at CMU date back to initial work on speech recognition by Raj Reddy, the Moza Bint Nasser University Professor of Computer Science and Robotics and former dean of SCS. Reddy worked on the Harpy and Hearsay Speech Recognition systems for an early DARPA’s Speech Understanding Research program. Harpy could transcribe sentences of around 1,100 words and a restricted syntax (roughly the vocabulary of a typical three-year-old), but it demonstrated that continuous transcription of speech (converting speech into text) was indeed possible.
Building on a CMU predecessor, the Dragon system, Harpy searched all possible syntactic and acoustic paths to transcribe a spoken sentence. While Hearsay was largely knowledge based, Harpy used a heuristic Beam Search algorithm which increased both speed and accuracy of recognition.
After early attempts to translate text from one language to another by machine had been abandoned in the ‘60s, Jaime Carbonell, the Allen Newell Professor of Computer Science, decided to try again in the 1980s. He founded the Center for Machine Translation in SCS in 1985, which became the Language Technologies Institute (LTI) in 1996. Carbonell directed the LTI until his death in 2020. Since its founding, the LTI has been the largest and most renown entity of its kind, leading the field in natural language processing, question-answering systems, speech recognition and synthesis.
Throughout his career, Waibel has shared a similar path as Carbonell. Both have worked to build machines that will allow anyone on the planet to communicate with anyone else without the barrier of language.
“When I started, that was crazy,” said Waibel. “Early on, people just shook their heads. So, it took 40 years, but you know, here we are. It’s actually possible. And it’s amazing how good it is.”
Waibel admits that although it may feel like a switch turns and the world changes overnight, the reality of technological development is far more incremental. It takes years and years of effort, detours and redirections by many people to achieve the incremental improvement necessary to provide transformative technology.
The Journey
Growing up in Barcelona and traveling the world, Waibel often faced language issues and awkward cultural situations. When he came to the U.S. to study at the Massachusetts Institute of Technology in 1976, he was interested in space and quantum physics, but he also had an engineering mind and wanted to combine his scientific curiosity with his desire to solve hard problems from a humanitarian perspective.
“You want to leave this planet with something that touches people’s lives,” he said.
At MIT, he found the human connection he sought in a group working on speech synthesis, with the goal of building machines that could read unrestricted texts. All in a technological world nearly unimaginable to today’s SCS undergraduate students. No email. No smart phones. No Google. No screens of any kind.
“In my first programming course we used punch cards,” Waibel said.
In his final year as an undergraduate, he told his advisor that he wanted to build a machine that you speak into and that could translate the speech and repeat it back in another language. The advisor gave Waibel a look suggesting his idea might be a little farfetched.
“He was, luckily, very polite,” said Waibel, “and told me ‘That sounds like a wonderful idea, Alex.’” And so, the dream was born. It was clear to Waibel that to realize the dream would require solving three problems: turning speech into text, translating that text from one language to another, and then speaking the text in another language. “All we had was the last component (synthesis). As it turned out that was the easiest of the three and I vastly underestimated the difficulty of the others,” said Waibel.
When it came time to decide what to do next for graduate studies, Waibel heard about Carnegie Mellon, where some of the world’s best researchers were working on speech recognition (the first speech-to-text component). This was about the same time Reddy was working on the Harpy Speech Recognition System.
“I visited and said [to Reddy] I’m here from MIT. And he said, ‘Can you stay here right away and start working now?’ And I said, ‘Well, let me pack, but Yes, I could.’”
At the time, there was no one working on machine translation and few people believed that spoken language translation (a much harder problem) was even possible. Waibel’s dream of making a communication machine, a mediator that could help people talk to other people in a different language required solutions to all three components.
So, short of having a solution for Machine Translation, Waibel began work on the first problem, turning speech into text. But which approach to take? It became clear to Waibel that the biggest problem was ambiguity. People say the same things in very different ways.
Reddy and colleagues, with their work on early systems, first attempted knowledge modules. The knowledge-based approach to speech meant trying to describe everything by the rules of speech. Getting in the way of this were issues with acoustic and noise variations. The complexities of that seemed too daunting, Waibel decided that we wouldn’t be able to do it by rules.
Other early work by Herb Simon and Allen Newell suggested solving recognition via search. But how do you represent the many levels of human speech (acoustics, syntax, semantics)? Despite early successes with Harpy and Dragon, there were still too many ways to say things — each with their own nuance and meaning — and it took too much design and modeling.
“We couldn’t program it, we needed something that learns like humans learn,” Waibel said. “A human can learn to speak but cannot explain how they do it. They just learn it.” [Editor’s note: In the print version of The LINK, this quote was misattributed to Raj Reddy. We apologize for the error.]
In the early ‘80s, however, machine learning was still in its infancy.
The next attempt was a statistical approach. Waibel worked with Kai-Fu Lee where they started to learn the statistics of speech and language. But to Waibel, the statistical approach still seemed too shallow to be the answer. “I felt there needed to be a way of learning the abstract knowledge within it, without explicit modeling,” said Waibel.
It was at this time when Geoffrey Hinton came to Carnegie Mellon and introduced Waibel to neural networks and backpropagation. Waibel had tinkered with perceptrons before, but backpropagation neural network training could adjust the model’s internal parameters by calculating the degree of the errors it makes, thus enabling the model to improve its predictions over time. Not only that, remarkably, such a simple algorithm could also learn internal abstractions — the gaining knowledge not specifically modeled — that could then be used, transferred and combined without explicit programming.
The only piece missing was machine translation. Waibel admits that here, he was fortunate once again. In 1986, a visiting researcher to CMU from Japan told Waibel they were opening a large laboratory in Japan called the Advanced Telecommunications Research Institute (ATR), where they were spearheading work on machine translation. Japan had funding for innovative projects at that time and — given communication problems between Japanese and English — it was a time to try again, Waibel noted.
Kai-Fu Lee, former Assistant Professor in CSD, and current CEO of Sinovation Ventures
Geoffrey Hinton, Turing Winner and former Associate
Professor in CSD
Raj Reddy and Harpy
Among myriad foundational developments for which Raj Reddy built the ground floor was his work on natural speech processing. The development of the Harpy speech recognition system, pioneered by Reddy, the Moza Bint Nasser University Professor of Computer Science, revolutionized the field. Throughout the late 1970s and early 1980s, Reddy and team set out to develop a speech recognition system capable of recognizing continuous, natural language input. The result was Harpy, a research project whose influence on the field of speech recognition can still be felt today.
The Harpy system was a precursor of Hidden Markov Models (HMMs). HMMs went on to become a foundational technique in many later speech recognition systems, and productization in industry.
Harpy’s success was partially due to the compilation of acoustic and language models, and an efficient search that considered the most likely hypothesis in their context. This helped make it more accurate as well as fast.
The Harpy system utilized acoustic modeling, or models of speech sounds. More sophisticated trainable acoustic models, including the subsequent Sphinx system, then evolved from concepts developed in Harpy.
The decoding algorithms and search strategies employed by the Harpy system for finding the most likely word sequences in speech data have influenced the development of modern decoding algorithms used in large-vocabulary continuous speech recognition (LVCSR) systems.
The Harpy system was able to adapt to different speakers by making use of training techniques, laying the groundwork for speaker adaptation and personalized voice recognition in later systems. After Harpy, companies like IBM, Microsoft, Google and Apple saw tremendous potential and invested heavily in speech recognition technology, creating many state-of-the-art systems that are in common use today.
Reddy’s work demonstrated the feasibility of automatic speech recognition systems. Harpy could understand and transcribe spoken language, helping pave the way for subsequent developments and enabling their integration into everyday applications and making it an integral part of human-computer interaction.
That’s it,” he thought. “Here was my chance to actually work on the third element of the vision machine translation and to expand practical techniques for neural learning.”
In continued partnership with Geoffrey Hinton, they proceeded to develop the time-delay neural network (TDNN), a network that was able to classify a pattern in shift invariant fashion, solving the long-standing problem of what and where to shift the classifier. It proved to be key because natural speech flows continuously and the precise location of sounds is not always known. The break from standard neural network classifiers was key to handling continuous speech (and later also vision problems as “convolutional neural networks”) .
But this was still the late 1980s and with computing power as it was at the time, large scale neural networks could not yet be trained and thus neural networks couldn’t perform much better than the statistical systems of the day.
“We were always stuck by not being able to compute more,” said Waibel.
That’s it,” he thought. “Here was my chance to actually work on the third element of the vision machine translation and to expand practical techniques for neural learning.”
In continued partnership with Geoffrey Hinton, they proceeded to develop the time-delay neural network (TDNN), a network that was able to classify a pattern in shift invariant fashion, solving the long-standing problem of what and where to shift the classifier. It proved to be key because natural speech flows continuously and the precise location of sounds is not always known. The break from standard neural network classifiers was key to handling continuous speech (and later also vision problems as “convolutional neural networks”) .
But this was still the late 1980s and with computing power as it was at the time, large scale neural networks could not yet be trained and thus neural networks couldn’t perform much better than the statistical systems of the day.
“We were always stuck by not being able to compute more,” said Waibel.
It wasn’t until improved computing power of the 2010s and more data from the internet that large “deep” multilayer neural networks showed their remarkable advantage. “We now know if you stack many “deep” neural network layers, they get better … and you get better results,” said Waibel. Using much the same models that Waibel and his colleagues had worked with years earlier, neural speech systems suddenly began to show dramatic performance improvements — up to 30% improvement, reaching performance levels equivalent to humans. Everyone was surprised and intrigued. “Now, there was no going back,” Waibel said, “and Neural Nets have finally taken the world of language processing by storm.”
After returning to CMU from Japan in 1989, Waibel started a neural network speech group. By combining machine knowledge-based translation techniques, neural speech systems and speech synthesis, the team developed their first working speech translation system in 1991, known as the JANUS Speech Translation System.
CMU held a press conference, linking labs in Germany with colleagues in Japan, and presented the first video conference with speech translation, where spoken English was translated into spoken Japanese and German. There was widespread media interest, including the New York Times and CNN.
“It was a bit of a sensation,” said Waibel. “But we could only do 500 words, it was slow, and you had to speak well formed sentences.” In reality, human speech is disfluent and fragmentary, and people rarely stick to a single domain or vocabulary. If the domain was conference registration, you couldn’t talk about astrophysics. So further work became necessary on transcribing and translating domain unlimited as well as spontaneous speech.
The proof came in being able to translate an entire lecture.
In 2005, Waibel and his colleagues successfully demonstrated the first unrestricted vocabulary, open domain, simultaneous speech recognition and translation system. Beyond the recognition and translation problems, they had to get the latency down so it could output words in real-time during a live speech. Once that had been achieved, they held a worldwide press conference at Carnegie Mellon demonstrating an automatic lecture translation system where they gave a lecture in English; and it recognized and translated it into Spanish, simultaneously.
Waibel’s focus then shifted to make the systems available to the public. About this time the iPhone 3GS had come out, as well as the development of the App Store. Waibel knew that the phone had the computing power to put the technology of speech recognition and translation into people’s hands.
Waibel and his partners founded a startup, Jibbigo, in 2009 and sold the technology on the App Store. One of the key advantages was that the app worked without the need for an internet connection, which meant that travelers could use it anywhere without roaming charges. The company was eventually acquired by Facebook in 2013. And Waibel helped transition the technology. He returned two years later to continue his work on simultaneous translation and to solve the challenges presented by larger vocabularies, the need for systems to be faster and to decrease latency.
The president of the German university Karlsruhe Institute of Technology (KIT), asked Waibel to develop a system to help international students to follow a lecture in German. Waibel worked on the German version of the lecture translator, which in 2012 was rolled out as a standard service at the university. The success and press garnered by the project led to other universities knocking on Waibel’s door. Then the secretary general of the European Parliament wanted a system, too. “And after 2013, we began field testing live simultaneous interpretation systems in one of the most demanding environments on
the planet,” Waibel said.
The requests kept Waibel’s team busy, but there came a point where he knew they couldn’t do it all from a university lab. Waibel was hesitant to begin yet another company, this being his 11th, so together with Sebastian Stuker, a colleague, they founded startup KITES. Waibel funded and advised Stuker.
Then came the pandemic.
Zoom became aware of what they’d been working on and with teaching and meetings now online around the world, the technology was compelling. After a year’s courtship, Zoom bought KITES in 2021.
“If you now see subtitles on Zoom and translation,” said Waibel, “it was seeded by our KITES team in Karlsruhe.
Toward Language Transparency
Waibel continues to work as a research fellow at Zoom on learning highly technical terms, names and acronyms, which change constantly.
Another challenge that excites Waibel is multilinguality, or the use of different languages in the same sentences. Switching languages mid-sentence makes it difficult for a system that was developed for one language.
“You have a German lecture in computer science that’s peppered with English words,” said Waibel, “and sometimes people switch to English when they want to say something. People come up with weird ways of speaking word creations that blend the two languages, but everyone understands what they mean because they know some English. And it’s even considered cool for young people in Germany to use a lot of English words in their German.” A multilingual system solves this problem.
Waibel’s goal all along has been to allow one person to understand another person in another culture with no barriers of any kind. He still strives to allow people to be able to communicate and interact with one another freely. But fully free communication goes beyond mere words.
“It’s not only voice; it’s multimodal,” he said.
Hand gestures, facial expressions and cultural affectations are part of communication, not simply the words themselves. To address this, Waibel works on including all such multimodal communication cues.
Another area of focus for Waibel is the area of automatic dubbing of videos. This technology takes the translation of videos in other languages to the next level by making use of the speaker’s own voice. AI alters the video to make the speaker appear to be speaking in a new language.
“We call it ‘Face Dub’ or ‘lip synchronous’ translations,” Waibel said, because the speaker’s mouth moves convincingly along with the translated language. A few obvious applications are movie dubbing (no subtitles to read) as well as global video conferencing.
Waibel remains aware of the threats posed by so-called deep fakes, or the use of these types of technologies to misrepresent the truth and thereby mislead people. He’s working on solutions for safeguarding people from potential threats — a much larger topic for another article.
Waibel believes we have moved beyond the term spoken language translation. He now prefers to call his goal language transparency.
The name — and certainly the technology — may change, but Waibel’s goals of fluid and comfortable communication, without barriers, remain the same. ■