2022 INSTITUTIONAL PARTICIPANTS

Miss Dan Su

PhD Candidate

The Hong Kong University of Science and Technology

Su Dan is a 4th-year PhD candidate at Department of Electronic & Computer Engineering in Hong Kong University of Science & Technology (HKUST), supervised by Prof. Pascale Fung. Her current research interests center around question answering and summarization. Previously, she has won several top-tier competitions, Kaggle COVID-19 Open Research Dataset Challenge (CORD-19) in 2020 organised by AI2, the Chatbot Millionaire Challenge organised by HK Science & Technology Park in 2019. The award-winning CAiRE-COVID system her team built has been on showcase at the HK Science Museum Exhibition and demonstrated at HK Virtual InnoCarnival 2020. Her submission topped the KITL leaderboard on open-domain long-form QA, organized by Facebook AI. She has published many papers on top conferences and serves as a reviewer for natural language processing top conferences such as ACL, NAACL, EMNLP, ARR. She received her Master’s degree from HKUST and her Bachelor’s degree from University of Science and Technology of China (USTC).

Generative Long-form Question Answering: Fluency, Relevance and Faithfulness

Question answering (QA) aims to build computer systems that can automatically answer questions posed by humans and has been a long-standing problem in natural language processing(NLP). This thesis investigates the particular problem of generative long-form question answering (LFQA), which aims to generate an in-depth, paragraph-length answer for a given question posed by a human. In this thesis, we investigate the task of LFQA and tackle the aforementioned challenges. Specifically, we first investigate how we can build a practical application for real-time open-domain LFQA. Then, we focus on enhancing the answer quality, in terms of 1) fluency, i.e., grammatical correctness, and low repetition, 2) relevance to the answer, and 3) faithfulness, which measures the factual correctness of the generated answer. We first present a real-time LFQA system that can generate multiple-documents-based answers efficiently. The system also demonstrates its effectiveness at generating fluent and somewhat relevant answers, winning one of the Kaggle competitions related to COVID-19. Then, we propose to incorporate the explicit answer relevance of the source documents into the generation model to enhance the relevance of the generated answer. Finally, we present a new end-to-end framework for generative QA that jointly models answer generation and machine reading to tackle the answer faithfulness challenge. State-of-the-art results on two LFQA datasets demonstrate the effectiveness of our method in comparison with strong baselines on automatic and human evaluation metrics. The method also topped one public leaderboard on the LFQA task