ELLIS header
University of Stuttgart Logo
Max Planck Institute for Intelligent Systems Logo

Distinguished Lecture Series - Talk by Frank Keller (University of Edinburgh)

We are pleased to announce our upcoming Distinguished Lecture Series talk by Frank Keller (University of Edinburgh)! The talk will take place in person on May 7th, in room UN32.101. Professor Keller will also be available for meetings on May 7th. If you are interested in scheduling a meeting, please email .

Frank Keller is a professor in the School of Informatics at the University of Edinburgh. He has held visiting positions at MIT and the University of Washington. His research focuses on natural language processing, particularly language and vision tasks such as Image description, video summarization, and visual storytelling. He also develops systems that understand long-form narratives, including books and screenplays, and builds computational models of human language processing.

Prof. Keller co-leads the UKRI Centre for Doctoral Training in Responsible Natural Language Processing, which aims to develop trustworthy and ethical NLP systems. He serves on the editorial board of Transactions of the ACL and is an ELLIS fellow. Previously, he was awarded an ERC grant for his research on language and vision.

Title: Grounding across Modalities and Domains

Grounding across Modalities and Domains

In order to understand or generate multimodal inputs, AI systems must perform grounding – the process of linking entities or actions across different modalities. For example, objects depicted in images and videos need to be associated with corresponding textual references. However, large language models struggle with grounding, limiting their performance in tasks such as image generation and video understanding.

In this talk, I will present two case studies demonstrating how explicit grounding can enhance multimodal AI. First, I will argue that character grounding is essential for visual storytelling – the task of turning a sequence of images into a coherent narrative. I will introduce a model that generates visually grounded stories by Building coreference chains for characters across images and text, leading to stories that are more specific, coherent, and engaging.

The second case study focuses on understanding instructional videos, such as those demonstrating cooking or home improvement tasks. In this domain, entities are often implicit (not mentioned in text) and frequently change (being merged, separated, or transformed), making grounding particularly challenging. I will present models that address this challenge by computing the semantic roles of both explicit and implicit entities and tracking them across instructional steps, even as they undergo transformations. These models enhance procedural understanding, improving AI’s ability to follow and reason about complex tasks.

Date: May 7th, 2025
Time: 9:45
Place: Universitätstraße 32.101, Campus Vaihingen of the University of Stuttgart.

Looking forward to seeing you all there! No registration necessary.