ADF 2025 Rising Stars Asia

Dr. Boyi Li

Postdoc / Research Scientist

UC Berkeley / NVIDIA

Research Vision: Intelligent Systems that Perceive, Imagine, and Act like Humans by Learning from Multimodal Data

My research objective is to build efficient intelligent systems that learn from the perception of the physical world and their interactions with humans to execute diverse and complex tasks to assist people. These systems should support seamless interactions with humans and computers in both digital software environments and tangible real-world contexts by learning from multimodal data.

I envision creating intelligent systems that possess the ability to perceive, imagine, and act in a manner akin to human beings by learning from multimodal data for a variety of tasks. However, training a vision-only system to achieve top-tier performance across varied tasks is challenging. Why? First, earlier methods require extensive specific data, which is impractical and expensive to gather. Second, different tasks involve complex environmental interactions, and are not fully represented by visual data alone. However, the importance of developing vision systems integrated with new modalities, such as language, has been undervalued in previous years. Traditional approaches rely on extensive, specific visual data and manual annotations, but face limitations in adapting to new concepts due to rigid categories. Meanwhile, there’s a wealth of unexplored text-visual data available. Nearly three years ago, I recognized this trend and noticed the critical importance of learning with multimodal data, prompting me to begin research in this area. Drawing from the groundwork laid by my previous endeavors in image / video processing, open-vocabulary object recognition, accurate scene understanding, text-aligned content creation, human animation with 3D control and interactive task planning, I have explored relatively uncharted text-visual and other multimodal data areas. My commitment is to lead a thorough research initiative that pushes the limits from algorithmic development to practical applications, covering areas such as recognition, scene understanding and reasoning, content generation, and robotics. Since 2022, models such as DALL·E 3 and GPT4 have demonstrated impressive results in text-to-image generation and vision-language comprehension. Training models on internet-scale vision-language data can significantly enhance generalization and facilitate the emergence of semantic reasoning. This highlights the importance and need for multimodal learning to efficiently utilize internet data and to improve the model’s ability to understand varied contexts.