HYPERTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks

2024. 12. 10. 21:44· 공부/논문
목차
  1. Contribution
  2. Related Works
  3. Text-to-speech models
  4. Speaker Adaptation in TTS
  5. Dynamic Parameters
  6. Method
  7. Encoder
  8. Variance Adapter
  9. Mel-Decoder and Postnet
  10. Hypernetwork
  11. Experiments
  12. Baseline models
  13. Datasets
  14. Evaluation Metrics
  15. Results and Discussions

https://arxiv.org/abs/2404.04645

 

HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks

Neural speech synthesis, or text-to-speech (TTS), aims to transform a signal from the text domain to the speech domain. While developing TTS architectures that train and test on the same set of speakers has seen significant improvements, out-of-domain spea

arxiv.org

Contribution

  1. Dynamic Adapters : learns speaker-adaptative adapters.
  2. Parameter Sampling : employ parameter sampling from a continuous distribution defined by a learnable hypernetwork
  3. Parameter Effiency : achieves competitive results with less than 1% of backbone params, making it highly practical and resource-friendly for scalable applications

Related Works

Text-to-speech models

  • Autoregressive TTS models : effective but slow in long utterances(training and inference speed)
  • Non-Autogressive TTS models : reducing latency and enhancing training efficiency, but reply on external aligners or pre-train AR models for phoneme duration
  • in-context learning : VALL-E, SPEAR-TTS leverage emerging codecs to learn discrete speech tokens and employ a vocoder-like decodec to convert these tokens into waveforms. VOICEBOX utilizes continuous features like mel spectrogram and hifi-gan.

Speaker Adaptation in TTS

  • aims to personalize the synthesized speech by modifying the voice characteristics to match those of a specific target speaker
  • aim to accommodate a wide range of linguistic variations, including diverse accents, speakers, and low-resource scenarios introduced by the target domain, while maintaining the number of trainable parameters
  • HYPER-TTS focuses on parameter-efficient domain adaptation of the backbone TTS model to a target set of speakers

Dynamic Parameters

  • specific to adapters, make prompt token dynamic by conditioning their values on input text using a parameter prompt generator network, used hypernetworks for generating adapter down and up-projection weights.
  • HYPER-TTS is the first work that studies the utility of a parameter generator in the domain of speech

Method

Encoder

  • phoneme sequence —map—> vector embedding mixed with sinusoidal positional encoding
  • Figure 2-(d) : 4 feed-forward transformer (FFT) blocks, (← adopt FastSpeech’s encoder’s FFT) with each block comprised of two multi-head attention modules and two 1D convolutions. (← over the phoneme sequence to capture local phoneme info, adjacent phonemes and mel-spectrogram features are more closely related in speech)

Variance Adapter

  • phoneme embeddings(length n) —transform—> mel-spectrogram embeddings (length m). m is typically larger than n
  • Duration Predictor : to solve length mismatch (each phonemes tends to map to one or multiple mel-frames). phonemes 을 받아서 duration 을 예측, 몇 개의 mel-frames 와 phoneme 이 매칭되는지를 나타내는 양수고, 이에 따라 다른 adapter에 들어가기 전에 phoneme seq 를 늘려줌
  • Pitch Predictor : use continuous wavelet transform (CWT), during inference output is converted back to pitch contours using inverse CWT(iCWT), by minimizing the MSE between the spectrogram, mean, variance values of the gt and predictions
  • Energy Predictor : estimates the original energy values for each STFT frame, for which it computes the L2-norm of the frame’s amplitude. 256 evenly distributed values 로 quantize, energy embedding 으로 encoding 하고 expanded hidden sequence 에 더해줌

Mel-Decoder and Postnet

  • the variance adaptor’s hidden sequence —convert—> mel-spectrogram
  • same architecture as the encoder but with 6 FFT blocks
  • to improve mel-spectrogram quality(reducing artifacts and distortions in speech), Postnet is used at the mel-decoder’s output

Hypernetwork

  • typically a small neural network that generates weights for a larger main network performing the usual learning task.
  • by learning to adapter their parameters with the change in speaker, enhance the effectiveness of adapters
  • d1-dimensional speaker embedding — speaker projector —> d2-dimensional space
  • concatenate a dl-dimensional layer embedding(learnable look-up table that maps a layer-id to a vector), (d2+dl)-dimensional vector —source projector network—> ds-dimensional space
  • for adapter down/up projection, sample weights from the hypernetwork through the source projector network using dedicated dense (Parameter Sampler) layers
  • Utilizing a hypernetwork to customize adapter block weights for the TTS backbone network, significantly expand the adapter parameter space. this enabling input-conditioned parameter sampling. additionally continuous parameter space theoretically allows the generation of adapter parameters for numerous speakers without increasing hypernetwork parameters.

Experiments

Baseline models

  • TTS-0 : zero-shot performance of the TTS model, pre-trained on LibriTTs and evaluated on target data without any fine-tuning(baseline lowerbound)
  • Reference and Reference (Voc.) : GT speech, (Voc.) is transformed reference speech into mel-spectrograms and reconstruct the speech using HiFi-GAN
  • TTS-FT (full fine-tuning) : obtained after fine-tuning all parameters of the backbone model on the target dataset(baseline upperbound)
  • AdapterTTS : inserts bottleneck adapter modules, a down/up-projection layer. learns only adapter parameters, keeping the backbone parameters frozen. $AdapterTTS_e, AdapterTTS_v, AdapterTTS_d, AdapterTTS_{e/v/d}$ : bottleneck adapter block inserted in the encoder, VA, decoder, and combination of all.
  • HyperTTS : bottleneck adapter block inserted in each module like AdapterTTS variants.$HyperTTS_e, HyperTTS_v, HyperTTS_d, HyperTTS_{e/v/d}$ the number of parameters : $HyperTTS_{e/v/d} > HyperTTS_e \approx HyperTTS_v \approx HyperTTS_d$

Datasets

  • LTS : train-clean-100 subset ← TTS backbone pre-trained.
  • LTS2 : dev-clean, test-clean ← adaptation
  • VCTK ← adaptation
  • divided into train and validation subsets

Evaluation Metrics

  • Objective : cosine similarity(COS) and F0 Frame Error (FFE), Mel cepstral distortion (MCD), Word Error Rate (WER)
  • Subjective : MOS, XAB test

Results and Discussions

  • HyperTTS 는 TTS-FT 에 근접하는 성능, AdapterTTS 는 성능이 안좋고, multi-speaker TTS 에서는 안좋은 결과.
  • MOS : TTS-FT > HYPERTTS_d(only used 0.422% of the parameters in TTS-FT) > AdapterTTS
  • XAB test : HYPERTTS_d > AdapterTTS
  • continuous space of parmeters, different refer speech from the same speaker clustered together while different spekaers are distant apart.

728x90
반응형

'공부 > 논문' 카테고리의 다른 글

WaveNet  (0) 2025.02.01
FastSpeech2  (0) 2025.02.01
A Study on the Efficacy of model pre-training in Developing Neural Text-to-speech System  (0) 2024.12.09
Adapting TTS models For New Speakers using Transfer Learning  (0) 2024.12.04
Speech ReaLLM - Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time  (0) 2024.08.20
  1. Contribution
  2. Related Works
  3. Text-to-speech models
  4. Speaker Adaptation in TTS
  5. Dynamic Parameters
  6. Method
  7. Encoder
  8. Variance Adapter
  9. Mel-Decoder and Postnet
  10. Hypernetwork
  11. Experiments
  12. Baseline models
  13. Datasets
  14. Evaluation Metrics
  15. Results and Discussions
'공부/논문' 카테고리의 다른 글
  • WaveNet
  • FastSpeech2
  • A Study on the Efficacy of model pre-training in Developing Neural Text-to-speech System
  • Adapting TTS models For New Speakers using Transfer Learning
내공얌냠
내공얌냠
내공냠냠
내공냠냠내공냠냠
내공얌냠
내공냠냠
내공얌냠
전체
오늘
어제
  • 분류 전체보기 (254)
    • 개발 (113)
      • mediapipe (16)
      • insightface (5)
      • JongjuAR (3)
    • 자료구조 알고리즘 (79)
      • 코딩테스트 (64)
      • 이론 (15)
    • 공부 (7)
      • 단행본 (7)
      • 튜토리얼 (19)
      • 논문 (15)
      • 복기 (5)
    • 참여 (5)

블로그 메뉴

  • 홈
  • 태그
  • 미디어로그
  • 위치로그
  • 방명록

공지사항

인기 글

태그

  • flutter conference
  • speaker adaptation tts
  • 컴퓨터 비전 기초
  • 음성인식 기초
  • 구글 미디어파이프
  • 머신러닝이란
  • 플러터 튜토리얼
  • git tutorial
  • 컴퓨터 비전
  • vscode 스프링 설치
  • mediapipe translate
  • google mediapipe
  • 깃 튜토리얼
  • postgresql install in mac
  • postgresql 재설치
  • 미디어파이프
  • kubeflow설치안됨
  • flutter 행사 후기
  • python telegrambot
  • 딥러닝 기반 음성인식 기초
  • flutter 행사
  • flutter
  • mediapipe
  • ios google places api
  • 플러터
  • 음성인식 튜토리얼
  • 컴퓨터 비전 책 추천
  • kubeflow설치가이드
  • torchscript vs onnx vs tensorrt
  • flutter tutorial

최근 댓글

최근 글

hELLO · Designed By 정상우.v4.2.2
내공얌냠
HYPERTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks
상단으로

티스토리툴바

단축키

내 블로그

내 블로그 - 관리자 홈 전환
Q
Q
새 글 쓰기
W
W

블로그 게시글

글 수정 (권한 있는 경우)
E
E
댓글 영역으로 이동
C
C

모든 영역

이 페이지의 URL 복사
S
S
맨 위로 이동
T
T
티스토리 홈 이동
H
H
단축키 안내
Shift + /
⇧ + /

* 단축키는 한글/영문 대소문자로 이용 가능하며, 티스토리 기본 도메인에서만 동작합니다.