공부/논문

Adapting TTS models For New Speakers using Transfer Learning

내공얌냠 2024. 12. 4. 20:50

Interspeech 2022

https://arxiv.org/abs/2110.05798

Contribution

  • present transfer learning methods and guidelines for finetuning single-speaker TTS models for a new voice
  • evaluate and provide a detailed analysis with varying amount of data
  • demonstrate that transfer learning can substantially reduce the training time and amount of data needed for synthesizing a new voice
  • open-source framework, provide a demo

Background and Related Work

  • decompose the waveform synthesis pipeline into two steps:
    1. synthesizing mel spectrograms from language
    2. vocoding the synthesized spectrogram to audible waveforms
    • multi-speaker 는 speaker embedding 으로 컨디션을 추가해서 spectrogram synthesis
  • To synthesize the voice of a target speaker, past works have investigated techniques like voice conversion and voice cloning
    • voice conversion
      • modify an utterance from a source speaker to make it sound like a target speaker
      • goal is dynamic frequency warping that aligns the spectra of different speakers
      • recently, spectral conversion is encoder-decoder neural networks are typically trained on speech pairs of target and source speakers
    • voice cloning
      • to synthesize speech for a new speaker (with limited speech data) for any unseen text
      • recent works on voice cloning have leveraged transfer learning
      • multi-speaker tts model conditioned on a speaker encoder, this speaker encoder is trained independently for the task of speaker verification on a large speaker-diverse dataset. then plugged into the spectrogram synthesizer as a frozen or trainable component.
      • the spectrogram-synthesizer can be conditioned directly on the speech samples of the new speaker to perform zero-shot voice cloning, also can be finetuned on the text and speech pairs of the new speaker
    • limitation of above techniques
      • require a large speaker-diverse dataset during training
      • such large-scale multi-speaker datasets are usually noisy and that resuls unsuitable for user-facing applications

Methodology

  • Spectrogram synthesizer
    • Use FastPitch and add a learnable alignment module that does not require ground-truth durations (→ not require gt durations)
    • FastPitch
      • composed of two feed former (FFTr) stacks
        1. input tokens of phonemes : $x = (x_1, \cdot, x_n)$
        2. first FFTr outputs a hidden representation $h = FFTr(x)$, used to predict the duration and average pitch of every token
        3. $\hat{d} = DurationPredictor(h), \space\space\space \hat{p} = PitchPredictor(h)$
        4. pitch is projected to match the dimensionality of the $h \in R^{n\times d}$ and added to $h$. $g = h + PitchEmbedding(p)$
        5. the resulting sum $g$ is direcetely upsampled and passed to the second FFTr, produces the output mel-spectrogram sequence $\hat{y} = FFTr([g_1, ..., g_1, ...g_n, ..., g_n]).$
      • duration prediction module : use learnable alignment-module and loss ($L_{align}$)
      • pitch-prediction module : use ground truth $p$, derived using PYIN, averaged over the input tokens using $\hat{d}$
      • mean-squared error between the predicted and ground-truth modalities and the forward-sum alignment loss $L_{align}$ : $L = ||\hat{y}-y||^2_2 + \alpha||\hat{p}-p||^2_2 + \beta||\hat{d}-d||^2_2 + \gamma L_{align}$
      • during training : end-to-end on text and speech pairs, $y, p$ are computed in the data-loading pipeline
      • during inference : use the predicted $\hat{p}, \hat{d}$ to synthesize speech directly from text
  • Vocoder
    • HiFi-GAN 구조 설명, unseen speaker 에 대해서 mel-spectrogram inversion 가능하다고 저자 논문이 리포트하였으나 오디오 품질을 위해서는 finetune 했음
    • transposed convolution 으로 mel-spectrograms 를 오디오로 upsample
    • two discriminator networks
      • multi-period discriminator consists of small sub-discriminators. obtain only specific periodic parts of raw waveform.
      • multi-scale discirminator consist of small sub-discriminators to judge audios in different scales and learn to capture consecutive patterns and long-term dependencies of the waveform.
  • Finetuning Methods
    • Direct Finetuning
      • finetune all the parameters of the pre-trained TTS models directly on the data of the new speaker
      • spectrogram-synthesis model ← text and speech pairs of the new speaker
      • vocoder ← only require the speech examples of the speaker
      • use mini-batch gradient descent with Adam optimizer using a fixed learning rate
    • Mixed Finetuning
      • Direct findtuning can result in overfitting or catastrophic forgetting when the amount of traning data of the new speaker is very limited → mix the original speaker’s data with the new speaker’s data during finetuning
      • assume that we have enough training samples of the original speaker while the number of samples of the new speaker is limited
      • create a data-loading pipeline that samples equal number of examples from the original and new speaker in each mini-batch
      • FastPitch 는 speaker embedding layer 가 없어서 만들어줌 (발화자 두 명이니까) during training, lookup the speaker embedding from the speaker id of each training sample, and add it to the text embedding at each time-step before feeding the input to the first FFTr of the FastPitch: $h = FFTr(x + Repeat(speakerEmb))$
      • FastPitch 모델 파라미터 ← from pre-trained model, but speaker embedding layer ← randomly initialized and trained along with the other parameters of the model
      • vocoder 는 spectrogram representation 이 speaker-specific attributes 를 포함하고 있기 때문에 speaker conditioning 따로 필요 없고, 그냥 두 발화자에 대해서 finetune
      • vocoder also use mini-batches with balanced data from the two speaker during finetuning

Experiments

  • Dataset
    • Hi-Fi TTS dataset, 292 hours of speech from 10 speakers with at least 17 hours per speaker sampled at 44.1 kHz
    • keep aside 50 text and speech pairs from each speaker as validation samples
    • train a single-speaker TTS model on female speaker and finetuning experiments on speaker each male and female
    • 4 training subsets for each finetuning speaker : 1 min, 5 min, 30 min, 60 min
    • for mixed finetuning, mix the new speaker’s data with 5000 samples (~ 5 hours) from pretrain speaker
  • Metrics
    • naturalness
      • MOS
      • mixed finetuning > direct finetuning
    • voice similarity to the target speaker
      • visualize the speaker embeddings of real and synth data by reducing the 256 dim utterance embeddings to 2 dimensions using t-SNE (figure 1)
      • EER (Equal Error Rate) on synthetic data
      • create positive and negative pairs
      • close to the on real data observation from the t-SNE plots
      • closely mimic the timbre of the target speaker
    • speaking style similarity to the target speaker
      • error metrics for the pitch (fundamental frequency) contours
      • GPE (Gross Pitch Error)
      • VDE (Voicing Decision Error)
      • F0 Frame Error
      • increasing the amount of training data reduces the difference between the speaking rate(phonemes per seconds from samples from text without using forced alignment) of actual data and synthetic speech
      • for both speakers, the speaking rate of synthetic speech is much faster than that of the actual data when we use ≤ 5 minutes of dataspeaking style similarity to the target speaker (Table 1)

 

all image are from reference paper : https://arxiv.org/abs/2110.05798

728x90
반응형