Multimodal Large Language Modeling

As impressive as chatbots like OpenAI’s ChatGPT and Google’s Bard are, one feature they lack is multimodal integration of text and images for both input and output.

ADAM KOHLHAAS

Researchers in Carnegie Mellon University’s Machine Learning Department (MLD) and Language Technologies Institute (LTI) have developed a multimodal large language model (LLM) named Generating Images with Large Language Models (GILL). GILL is one of the first models that can process and produce layered images and text, where images and text can be provided as the inputs and the outputs.

GILL accepts both images and text as input and determines the best modality in which to respond. Along with plain text responses, it can generate images when a more creative answer is needed or an existing image is not available. It can also pull images from an archive in situations requiring a factual response. This flexibility allows the model to seamlessly generate relevant images and layer them with text outputs, producing image and text responses that may be more illustrative than text-only outputs.

Screenshots of messages between a user and AI generated responses.

“I’m excited about GILL because it is one of the first models that can process image-text inputs to generate text interleaved with retrieved and generated images,” said Jing Yu Koh, an MLD Ph.D. student and one of GILL’s co-authors. “It is more general than previous multimodal language models and has the potential for a wide set of applications.”

To achieve this unique combination of abilities, CMU researchers proposed an efficient mapping network to ground the output space of a frozen text-only LLM to the input space of a frozen text-to-image generation model. This action allows the LLM to be efficiently trained to produce vector embeddings compatible with those of the generation model. GILL exhibits a wider range of capabilities compared to prior multimodal language models (such as the ability to generate novel images) and outperforms non-LLM-based generation models across several text-to-image tasks that measure context dependence.

The CMU members involved in this research include Koh; Daniel Fried, an assistant professor in the LTI; and Ruslan Salakhutdinov, a professor in MLD. They’re excited about the potential that their method has in future applications.

“GILL is modular, and our approach is model agnostic, meaning that it will likely benefit from applying stronger LLMs and visual models released in the future,” Koh said.

Computer Science at CMU underpins divergent fields and endeavors in today’s world, all of which LINK SCS to profound advances in art, culture, nature, the sciences and beyond.