Ancient text book with digital code overlay

Computer Science at CMU underpins divergent fields and endeavors in today’s world, all of which link SCS to profound advances in art, culture, nature, the sciences and beyond.

CHARLOTTE HU

Models to Help Linguists Reconstruct Ancient Languages

Tracing languages back to their roots often presents challenges like missing words, scant evidence and countless questions. Researchers in Carnegie Mellon University’s School of Computer Science hope to tackle some of these challenges using computer models that help reconstruct ancient languages.

Even though they’re no longer used, the protolanguages that spawned modern languages can provide a window into the cultural past. They shine light on how people have moved, their relationships and how different groups came into contact in ways that can’t be gleaned from archaeological artifacts or population genetics.

To reconstruct missing or incomplete ancestral languages, David Mortensen, an assistant research professor in the Language Technologies Institute, worked with SCS undergraduate Liang (Leon) Lu and University of Southern California student Peirong Xie to create a semisupervised computer model that can derive protolanguages from their modern forms based on a small set of examples. Their work, “Semisupervised Neural Protolanguage Reconstruction,” received a Best Paper Award at the 2024 Association for Computational Linguistics conference in Bangkok.

“For most language families in the world, their protolanguages have not been reconstructed,” Lu said.

Not all ancestor languages are known. For example, Proto-Germanic, the common ancestor language of German, Swedish and Icelandic, is one of the many missing languages that historical linguists have been trying to reconstruct.

“As far as we know, nobody ever wrote down Proto-Germanic, but it’s the shared ancestor of all these languages,” Mortensen said. “Protolanguage reconstruction is about piecing together those languages.”

Mortensen’s lab first set out to train computer models to reconstruct the final few missing words of a protolanguage. The model relied on modern words and their known ancestors — also called protoforms — to fill in the gaps. It needed a lot of human guidance, and it wasn’t scalable or all that useful to historical linguists.

“Languages have family trees. Latin split into Spanish, French, Portuguese, Italian and Romanian. English comes from Middle English, which comes from Old English, which comes from Proto-West Germanic,” Mortensen said. “You could recover or reconstruct what the ancestor languages were like by comparing the descendant languages and assuming a consistent set of changes.”

This semisupervised approach provides the most practical solution for historical linguists trying to reconstruct a protolanguage from its descendants in a new language family. The model can start working with only a few hundred example translations provided by a human linguist. Then it can predict protoforms for words that the linguist hasn’t worked out yet. This tool has the potential to ease the total workload of the human linguist, although it does need to be trained separately for each language family.

Learn more about the research and access the models on the project’s website.

More from the Spring 2025 LINK Issue

From the Dean: Toward AI and Computer Literacy

Students in classroom with digital math problem overlay

Sail() Platform Revolutionizes Tech Education in Community Colleges and Beyond