Learning to Synthesize Images with Multimodal and Hierarchical Inputs

Speaker: Yu Zeng

Date: Mar 25, 11:45am–12:45pm

Abstract: In recent years, the field of image synthesis and manipulation has experienced remarkable advancements driven by the success of deep learning methods and the availability of Web-scale datasets. Despite this progress, most current approaches predominantly rely on generating images based on simplistic inputs such as text and label maps. While these methods have demonstrated an impressive capability in generating realistic images, there persists a notable disconnect between the intricate nature of human ideas and the simplistic input structures employed by the existing models. In this talk, I will present our research towards a more natural way for controllable image synthesis inspired by the coarse-to-fine workflow of human artists and the inherently multimodal aspect of human thought processes. We consider the inputs of semantic and visual modality, as well as the varying levels of hierarchy under each modality. For the semantic modality, we introduce a general framework for modeling semantic inputs of different levels, which includes image-level text prompts and pixel-level label maps as two extremes and brings a series of mid-level regional descriptions with different precision. For the visual modality, we explore the use of low-level and high-level visual inputs aligning with the natural hierarchy of visual processing. Specifically, we model the low-level process as an image manipulation problem, characterized by the pixel-wise alignment between the output and input; while the high-level process is cast into a reference-based generation where the goal is to transfer information from reference images to the generated images at the granularity of objects or concepts. Additionally, as the misuse of generated images becomes a societal threat, I will introduce our findings on the trustworthiness of deep generative models in the second part of this talk. After that, I will discuss some potential future research directions.

Biographical Sketch: I am a PhD candidate at Johns Hopkins University being advised by Prof. Vishal M Patel. My research interest lies in computer vision and deep learning. I have focused on two main areas: (1) deep generative models for image synthesis and editing, and (2) label-efficient deep learning. By combining these research areas, I aim to bridge human creativity and machine intelligence through user-friendly and socially responsible models while minimizing the need for intensive human supervision..

Location and Zoom link: 307 Love, or https://fsu.zoom.us/j/98822036003