Variational Inference with adversarial learning for end-to-end Text-to-Speech (VITS).
Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and parallel sampling have been proposed, but their sample quality does not match that of two-stage TTS systems. In this work, we present a parallel end-to-end TTS method that generates more natural sounding audio than current two-stage models. Our method adopts variational inference augmented with normalizing flows and an adversarial training process, which improves the expressive power of generative modeling. We also propose a stochastic duration predictor to synthesize speech with diverse rhythms from input text. With the uncertainty modeling over latent variables and the stochastic duration predictor, our method expresses the natural one-to-many relationship in which a text input can be spoken in multiple ways with different pitches and rhythms. A subjective human evaluation (mean opinion score, or MOS) on the LJ Speech, a single speaker dataset, shows that our method outperforms the best publicly available TTS systems and achieves a MOS comparable to ground truth.
single-stage 학습이 가능하게 하고 병렬적으로 샘플링하는 것이 제안되었지만, two-stage TTS 시스템보다 quality가 떨어졌는데 우리가 해냈다.
two-stage 모델의 흐름
The first stage is to produce intermediate speech representations such as melspectrograms (Shen et al., 2018) or linguistic features from the preprocessed text,1 and the second stage is to generate raw waveforms conditioned on the intermediate representations
two-stage 모델의 문제점
their sequential generative process makes it difficult to fully utilize modern parallel processors.
two-stage pipelines remain problematic because they require sequential training or fine-tuning
end-to-end 최근의 모델들(FastSpeech 2s and EATS)
despite potentially improving performance by utilizing the learned representations, their synthesis quality lags behind two-stage systems.
특징
parallel end-to-end TTS method that generates more natural sounding audio than current two-stage models.
we connect two modules of TTS systems through latent variables to enable efficient end-to-end learning.
표현력을 높이기 위해
To improve the expressive power of our method so that high-quality speech waveforms can be synthesized, we apply normalizing flows to our conditional prior distribution and adversarial training on the waveform domain.
텍스트가 pitch and duration 와 같은 여러 형태로 표현되는 문제들,
one-to-many problem, we also propose a stochastic duration predictor to synthesize speech with diverse rhythms from input text.
With the uncertainty modeling over latent variables and the stochastic duration predictor, our method captures speech variations that cannot be represented by text.
Glow-TTS + HifiGAN 보다 더 좋다.
obtains more natural sounding speech and higher sampling efficiency
we aim to provide more high-resolution information for the posterior encoder. We, therefore, use the linear-scale spectrogram of target speech xlin as input rather than the mel-spectrogram
실제적인 샘플을 만들기 위해서 이전 분포의 표현력을 높이는 것은 중요하다. 그래서 우리는 normalizing flow를 적용시켰다. 이것은 간단한 분포에서 더 복잡한 분포로 뒤집을 수 있는 변형을 허용…
We found that increasing the expressiveness of the prior distribution is important for generating realistic samples. We, therefore, apply a normalizing flow fθ (Rezende & Mohamed, 2015), which allows an invertible transformation of a simple distribution into a more complex distribution following the rule of change-of variables, on top of the factorized normal prior distribution
입력 텍스트와 타겟 스피치 사이의 alignment를 측정하기 위해 Monotonic alignment search를 사용
To estimate an alignment A between input text and target speech, we adopt Monotonic Alignment Search
사람처럼 말하게 하기 위해 우리는 stochastic duration predictor를 설계해서 그것이 주어진 duration 분포를 따라 샘플링할 수 있게.
To generate human-like rhythms of speech, we design a stochastic duration predictor so that its samples follow the duration distribution of given phonemes.
stochastic duration predictor는 flow-based 모델
The stochastic duration predictor is a flow-based generative model
discrete integer(지속적인 normalizing flow를 위해 dequantized 되야 하는), scalar(가역성 때문에 높은 차원 변형을 박기 위한) 문제를 해결하기 위해 variational dequantization과 variational data augmentation을 적용했다.
The direct application of maximum likelihood estimation, however, is difficult because the duration of each input phoneme is 1) a discrete integer, which needs to be dequantized for using continuous normalizing flows, and 2) a scalar, which prevents high-dimensional transformation due to invertibility. We apply variational dequantization (Ho et al., 2019) and variational data augmentation (Chen et al., 2020) to solve these problems
stop gradient operator 적용, 역전파를 막아서 다른 모듈에 duration predictor의 학습이 영향을 끼치지 않도록 한다
We apply the stop gradient operator which prevents back-propagating the gradient of inputs, to the input conditions so that the training of the duration predictor does not affect that of other modules.
샘플링 처리는 비교적 간단. phoneme duration은 랜덤 노이즈로부터 stochastic duration predictor의 inverse transformation을 거쳐 샘플된다 그리고 그것은 integer로 convert 된다
The sampling procedure is relatively simple; the phoneme duration is sampled from random noise through the inverse transformation of the stochastic duration predictor, and then it is converted to integers.
모델 구조
posterior encoder, prior encoder, decoder, discriminator, and stochastic duration predictor
The posterior encoder and discriminator are only used for training, not for inference.
prior encoder 안의 normalizing flow는 four affine coupling layers로 쌓여있고, 각 레이어는 four WaveNet residual blocks로 이루어져 있다
posterior encoder는 16 WaveNet residual blocks로 이루어져있고 takes linear-scale log magnitude spectrograms and produce latent variables with 192 channels.
decoder 의 input은 prior, posterior encoders로부터 생성된 latent variables 이다. 그래서 input channel size는 192이다.
decoder의 마지막 conv layer를 위해 우리는 bias parameter를 지우고, 그것은 unstable gradient sacles를 유말하기 때문이(mixed precision training동안)
distriminator 를 위해 Hifi-GAN은 5개의 sub-discriminators(period는 2,3,5,7,11)를 포함한 multi-period discriminator를 사용하고 multi-scale discriminator는 세개의 sub-discriminators를 포함
트레이닝 효율을 높이기 위해 우리는 multi-scale discriminator의 첫번째 sub-discriminator 를 남기고, 그것은 raw waveforms를 구성하고, 두개의 sub-discriminators를 버린다. average-pooled waveforms를 구성하면서
resultant discriminator는 multi-period discriminator로써 볼 수 있다 periods와 함께(1,2,3,5,7,11)
The posterior encoder and normalizing flow module have four coupling layers of neural spline flows. . Each coupling layer first processes input and input conditions through a DDSConv block and produces 29-channel parameters that are used to construct 10 rational-quadratic functions. We set the hidden dimension of all coupling layers and condition encoders to 192.