Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search
Recently, text-to-speech (TTS) models such as FastSpeech and ParaNet have been proposed to generate mel-spectrograms from text in parallel. Despite the advantages, the parallel TTS models cannot be trained without guidance from autoregressive TTS models as their external aligners.In this work, we propose Glow-TTS, a flow-based generative model for parallel TTS that does not require any external aligner. We introduce Monotonic Alignment Search (MAS), an internal alignment search algorithm for training Glow-TTS. By leveraging the properties of flows, MAS searches for the most probable monotonic alignment between text and the latent representation of speech. Glow-TTS obtains an order-of-magnitude speed-up over the autoregressive TTS model, Tacotron 2, at synthesis with comparable speech quality, requiring only 1.5 seconds to synthesize one minute of speech in end-to-end. We further show that our model can be easily extended to a multi-speaker setting. Our demo page and code are available at public.
Glow-TTS:通过单调对齐搜索从文本到语音的生成流
最近,已经提出了文本到语音(TTS)模型,例如FastSpeech和ParaNet,用于从文本中并行生成频谱图。尽管有这些优点,但是如果没有自回归TTS模型作为外部对齐器的指导,则无法训练并行TTS模型。.. 在这项工作中,我们提出了Glow-TTS,这是一种基于流的并行TTS生成模型,不需要任何外部对准器。我们介绍了单调对齐搜索(MAS),这是一种用于训练Glow-TTS的内部对齐搜索算法。通过利用流的属性,MAS在文本和语音的潜在表示之间寻找最可能的单调对齐方式。Glow-TTS在自动回归TTS模型Tacotron 2上获得了语音质量可比的数量级加速,合成时具有可比的语音质量,仅需1.5秒即可端到端合成一分钟的语音。我们进一步表明,我们的模型可以轻松扩展到多扬声器设置。我们的演示页面和代码可公开获取。 (阅读更多)
暂无评论