Speech Inpainting: Context-based Speech Synthesis Guided by Video

Paper
Code + Weights
Demos
Paper under review



Abstract

Audio and visual modalities are inherently connected in speech signals: Lips movements and facial expressions are correlated with the production of speech sounds. This moti- vates the studies that incorporate visual modality to enhance an acoustic speech signal or even restore missing audio in- formation. In this paper, we present a transformer-based deep learning model which produces state-of-the-art results in audio-visual speech inpainting. Given an audio-visual sig- nal whose audio stream is partially corrupted, audio-visual speech inpainting is the task of synthesizing the audio of the corrupted segment coherently to the corresponding video and the uncorrupted audio. We compare the performance of our model against the previous state-of-the-art model and audio-only baselines, showing the importance of having an additional cue that provides information about the content of the corrupted audio. We also show how the visual features from AV-Hubert [1] are suitable for synthesizing speech.