报告时间:2026年1月13日(周二)上午10:00
报告地点:武汉大学电子信息学院于刚·宋晓楼B618会议室
报告题目:High Quality Audio Generation with Latent Diffusion Models
报 告 人:Haohe Liu, Research Scientist at Meta, Redmond, WA, USA
邀 请 人:黄公平 教授
报告摘要:
High-quality audio generation remains a core challenge in artificial intelligence, with applications in content creation, virtual environments, and assistive technologies. To address this challenge effectively, intuitive user interaction is important, and our preliminary user study on AI-assisted sound-search systems suggests that natural language provides a flexible and user-friendly means for interacting with audio. Inspired by this insight, we focus on text-to-audio generation and explore how diffusion-based models can improve the diversity, quality, and efficiency of audio generation. This thesis introduces AudioLDM and its successor AudioLDM 2 for text-to-audio generation, followed by AudioSR for audio super-resolution, and SemantiCodec to enable efficient, semantically-aware audio compression.
Previous works on text-to-audio generation often fall short on the scope of the audio they can generate. They may not follow the text prompt well, and the audio it produces is often restricted to limited categories. To address these challenges, we propose AudioLDM, a text-to-audio model that can accept open-domain natural language as input and generate audio that follows the given descriptions. To further explore the possibility of combining the advantages of diffusion modelling and language modelling, we introduce AudioLDM 2, which enables the generation of speech, music, and environmental sound effects from text within a shared architecture, with improved generation quality over AudioLDM.
To further enhance the output quality of the audio generation model, we propose AudioSR, a latent diffusion-based model specialized in audio super-resolution. Unlike previous methods that focus on narrow domains such as speech, AudioSR is designed to enhance a broad spectrum of sounds, including speech, music, and general sound effects, and uniquely supports input audio with flexible sampling rates, making AudioSR robust for diverse real-world use cases. Our experiments show that AudioSR can effectively restore missing high-frequency details with significantly improved perceptual quality, as verified on both real audio recordings and generative model outputs.
Finally, we explore compact and semantically rich latent spaces for future audio generation models. Existing language-modelling-based audio generation models often rely on high-bitrate codecs, leading to high computational costs, limited semantic abstraction in the audio tokens, and inefficient downstream model training. To address these challenges, we propose SemantiCodec, an ultra-low bitrate codec that encodes audio into compressed but semantically rich representations. Building upon the self-supervised learning feature, SemantiCodec achieves a good tradeoff between compression efficiency and reconstruction quality, outperforming existing neural codecs at significantly lower bitrates.
报告人简介:

Haohe Liu is currently a Research Scientist at Meta, Redmond, WA, USA. He did his PhD at the Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey with Prof. Mark D. Plumbley and Prof. Wenwu Wang. During his PhD, Haohe conducted research on the understanding and recreation of music/sound effect/speech. He is the primary author of several influential projects, including AudioLDM, AudioLDM 2, NaturalSpeech, and AudioSR, with his open-source work earning over 10000 GitHub stars and over 4000 of Google Scholar citations.
欢迎感兴趣的老师和同学们积极参与!


学院地址: 湖北省武汉市武昌区八一路299号 (430072)
Address:No.299 Bayi Road,Wuhan,Hubei(P.R.C.:430072)
联系电话 (Tel) :(+86)27-68756275/68778537
传真 (Fax) :(+86)27-68778537
网址 (Http) : Http://eis.whu.edu.cn
联系邮箱 (Email) : eisyb@whu.edu.cn
武汉大学电子信息学院
官方微信公众号