Adapting TTS models For New Speakers using Transfer Learning

2024. 12. 4. 20:50· 공부/논문
목차
  1. Contribution
  2. Background and Related Work
  3. Methodology
  4. Experiments

Interspeech 2022

https://arxiv.org/abs/2110.05798

Contribution

  • present transfer learning methods and guidelines for finetuning single-speaker TTS models for a new voice
  • evaluate and provide a detailed analysis with varying amount of data
  • demonstrate that transfer learning can substantially reduce the training time and amount of data needed for synthesizing a new voice
  • open-source framework, provide a demo

Background and Related Work

  • decompose the waveform synthesis pipeline into two steps:
    1. synthesizing mel spectrograms from language
    2. vocoding the synthesized spectrogram to audible waveforms
    • multi-speaker 는 speaker embedding 으로 컨디션을 추가해서 spectrogram synthesis
  • To synthesize the voice of a target speaker, past works have investigated techniques like voice conversion and voice cloning
    • voice conversion
      • modify an utterance from a source speaker to make it sound like a target speaker
      • goal is dynamic frequency warping that aligns the spectra of different speakers
      • recently, spectral conversion is encoder-decoder neural networks are typically trained on speech pairs of target and source speakers
    • voice cloning
      • to synthesize speech for a new speaker (with limited speech data) for any unseen text
      • recent works on voice cloning have leveraged transfer learning
      • multi-speaker tts model conditioned on a speaker encoder, this speaker encoder is trained independently for the task of speaker verification on a large speaker-diverse dataset. then plugged into the spectrogram synthesizer as a frozen or trainable component.
      • the spectrogram-synthesizer can be conditioned directly on the speech samples of the new speaker to perform zero-shot voice cloning, also can be finetuned on the text and speech pairs of the new speaker
    • limitation of above techniques
      • require a large speaker-diverse dataset during training
      • such large-scale multi-speaker datasets are usually noisy and that resuls unsuitable for user-facing applications

Methodology

  • Spectrogram synthesizer
    • Use FastPitch and add a learnable alignment module that does not require ground-truth durations (→ not require gt durations)
    • FastPitch
      • composed of two feed former (FFTr) stacks
        1. input tokens of phonemes : $x = (x_1, \cdot, x_n)$
        2. first FFTr outputs a hidden representation $h = FFTr(x)$, used to predict the duration and average pitch of every token
        3. $\hat{d} = DurationPredictor(h), \space\space\space \hat{p} = PitchPredictor(h)$
        4. pitch is projected to match the dimensionality of the $h \in R^{n\times d}$ and added to $h$. $g = h + PitchEmbedding(p)$
        5. the resulting sum $g$ is direcetely upsampled and passed to the second FFTr, produces the output mel-spectrogram sequence $\hat{y} = FFTr([g_1, ..., g_1, ...g_n, ..., g_n]).$
      • duration prediction module : use learnable alignment-module and loss ($L_{align}$)
      • pitch-prediction module : use ground truth $p$, derived using PYIN, averaged over the input tokens using $\hat{d}$
      • mean-squared error between the predicted and ground-truth modalities and the forward-sum alignment loss $L_{align}$ : $L = ||\hat{y}-y||^2_2 + \alpha||\hat{p}-p||^2_2 + \beta||\hat{d}-d||^2_2 + \gamma L_{align}$
      • during training : end-to-end on text and speech pairs, $y, p$ are computed in the data-loading pipeline
      • during inference : use the predicted $\hat{p}, \hat{d}$ to synthesize speech directly from text
  • Vocoder
    • HiFi-GAN 구조 설명, unseen speaker 에 대해서 mel-spectrogram inversion 가능하다고 저자 논문이 리포트하였으나 오디오 품질을 위해서는 finetune 했음
    • transposed convolution 으로 mel-spectrograms 를 오디오로 upsample
    • two discriminator networks
      • multi-period discriminator consists of small sub-discriminators. obtain only specific periodic parts of raw waveform.
      • multi-scale discirminator consist of small sub-discriminators to judge audios in different scales and learn to capture consecutive patterns and long-term dependencies of the waveform.
  • Finetuning Methods
    • Direct Finetuning
      • finetune all the parameters of the pre-trained TTS models directly on the data of the new speaker
      • spectrogram-synthesis model ← text and speech pairs of the new speaker
      • vocoder ← only require the speech examples of the speaker
      • use mini-batch gradient descent with Adam optimizer using a fixed learning rate
    • Mixed Finetuning
      • Direct findtuning can result in overfitting or catastrophic forgetting when the amount of traning data of the new speaker is very limited → mix the original speaker’s data with the new speaker’s data during finetuning
      • assume that we have enough training samples of the original speaker while the number of samples of the new speaker is limited
      • create a data-loading pipeline that samples equal number of examples from the original and new speaker in each mini-batch
      • FastPitch 는 speaker embedding layer 가 없어서 만들어줌 (발화자 두 명이니까) during training, lookup the speaker embedding from the speaker id of each training sample, and add it to the text embedding at each time-step before feeding the input to the first FFTr of the FastPitch: $h = FFTr(x + Repeat(speakerEmb))$
      • FastPitch 모델 파라미터 ← from pre-trained model, but speaker embedding layer ← randomly initialized and trained along with the other parameters of the model
      • vocoder 는 spectrogram representation 이 speaker-specific attributes 를 포함하고 있기 때문에 speaker conditioning 따로 필요 없고, 그냥 두 발화자에 대해서 finetune
      • vocoder also use mini-batches with balanced data from the two speaker during finetuning

Experiments

  • Dataset
    • Hi-Fi TTS dataset, 292 hours of speech from 10 speakers with at least 17 hours per speaker sampled at 44.1 kHz
    • keep aside 50 text and speech pairs from each speaker as validation samples
    • train a single-speaker TTS model on female speaker and finetuning experiments on speaker each male and female
    • 4 training subsets for each finetuning speaker : 1 min, 5 min, 30 min, 60 min
    • for mixed finetuning, mix the new speaker’s data with 5000 samples (~ 5 hours) from pretrain speaker
  • Metrics
    • naturalness
      • MOS
      • mixed finetuning > direct finetuning
    • voice similarity to the target speaker
      • visualize the speaker embeddings of real and synth data by reducing the 256 dim utterance embeddings to 2 dimensions using t-SNE (figure 1)
      • EER (Equal Error Rate) on synthetic data
      • create positive and negative pairs
      • close to the on real data observation from the t-SNE plots
      • closely mimic the timbre of the target speaker
    • speaking style similarity to the target speaker
      • error metrics for the pitch (fundamental frequency) contours
      • GPE (Gross Pitch Error)
      • VDE (Voicing Decision Error)
      • F0 Frame Error
      • increasing the amount of training data reduces the difference between the speaking rate(phonemes per seconds from samples from text without using forced alignment) of actual data and synthetic speech
      • for both speakers, the speaking rate of synthetic speech is much faster than that of the actual data when we use ≤ 5 minutes of dataspeaking style similarity to the target speaker (Table 1)

 

all image are from reference paper : https://arxiv.org/abs/2110.05798

728x90
반응형

'공부 > 논문' 카테고리의 다른 글

HYPERTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks  (0) 2024.12.10
A Study on the Efficacy of model pre-training in Developing Neural Text-to-speech System  (0) 2024.12.09
Speech ReaLLM - Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time  (0) 2024.08.20
Fre-GAN: Adversarial Frequency-consistent Audio Synthesis  (0) 2023.06.02
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis  (0) 2023.05.24
  1. Contribution
  2. Background and Related Work
  3. Methodology
  4. Experiments
'공부/논문' 카테고리의 다른 글
  • HYPERTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks
  • A Study on the Efficacy of model pre-training in Developing Neural Text-to-speech System
  • Speech ReaLLM - Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time
  • Fre-GAN: Adversarial Frequency-consistent Audio Synthesis
내공얌냠
내공얌냠
내공냠냠
내공얌냠
내공냠냠
내공얌냠
전체
오늘
어제
  • 분류 전체보기 (254)
    • 개발 (113)
      • mediapipe (16)
      • insightface (5)
      • JongjuAR (3)
    • 자료구조 알고리즘 (79)
      • 코딩테스트 (64)
      • 이론 (15)
    • 공부 (7)
      • 단행본 (7)
      • 튜토리얼 (19)
      • 논문 (15)
      • 복기 (5)
    • 참여 (5)

블로그 메뉴

  • 홈
  • 태그
  • 미디어로그
  • 위치로그
  • 방명록

공지사항

인기 글

태그

  • 구글 미디어파이프
  • 미디어파이프
  • kubeflow설치가이드
  • 음성인식 튜토리얼
  • 깃 튜토리얼
  • ios google places api
  • torchscript vs onnx vs tensorrt
  • 플러터
  • python telegrambot
  • flutter tutorial
  • postgresql 재설치
  • vscode 스프링 설치
  • kubeflow설치안됨
  • mediapipe
  • 컴퓨터 비전 책 추천
  • 딥러닝 기반 음성인식 기초
  • flutter 행사 후기
  • google mediapipe
  • git tutorial
  • 컴퓨터 비전
  • 플러터 튜토리얼
  • flutter
  • 음성인식 기초
  • 머신러닝이란
  • mediapipe translate
  • 컴퓨터 비전 기초
  • postgresql install in mac
  • flutter 행사
  • flutter conference
  • speaker adaptation tts

최근 댓글

최근 글

hELLO · Designed By 정상우.v4.2.2
내공얌냠
Adapting TTS models For New Speakers using Transfer Learning
상단으로

티스토리툴바

단축키

내 블로그

내 블로그 - 관리자 홈 전환
Q
Q
새 글 쓰기
W
W

블로그 게시글

글 수정 (권한 있는 경우)
E
E
댓글 영역으로 이동
C
C

모든 영역

이 페이지의 URL 복사
S
S
맨 위로 이동
T
T
티스토리 홈 이동
H
H
단축키 안내
Shift + /
⇧ + /

* 단축키는 한글/영문 대소문자로 이용 가능하며, 티스토리 기본 도메인에서만 동작합니다.