https://arxiv.org/abs/2404.04645
Contribution
- Dynamic Adapters : learns speaker-adaptative adapters.
- Parameter Sampling : employ parameter sampling from a continuous distribution defined by a learnable hypernetwork
- Parameter Effiency : achieves competitive results with less than 1% of backbone params, making it highly practical and resource-friendly for scalable applications
Related Works
Text-to-speech models
- Autoregressive TTS models : effective but slow in long utterances(training and inference speed)
- Non-Autogressive TTS models : reducing latency and enhancing training efficiency, but reply on external aligners or pre-train AR models for phoneme duration
- in-context learning : VALL-E, SPEAR-TTS leverage emerging codecs to learn discrete speech tokens and employ a vocoder-like decodec to convert these tokens into waveforms. VOICEBOX utilizes continuous features like mel spectrogram and hifi-gan.
Speaker Adaptation in TTS
- aims to personalize the synthesized speech by modifying the voice characteristics to match those of a specific target speaker
- aim to accommodate a wide range of linguistic variations, including diverse accents, speakers, and low-resource scenarios introduced by the target domain, while maintaining the number of trainable parameters
- HYPER-TTS focuses on parameter-efficient domain adaptation of the backbone TTS model to a target set of speakers
Dynamic Parameters
- specific to adapters, make prompt token dynamic by conditioning their values on input text using a parameter prompt generator network, used hypernetworks for generating adapter down and up-projection weights.
- HYPER-TTS is the first work that studies the utility of a parameter generator in the domain of speech
Method
Encoder
- phoneme sequence —map—> vector embedding mixed with sinusoidal positional encoding
- Figure 2-(d) : 4 feed-forward transformer (FFT) blocks, (← adopt FastSpeech’s encoder’s FFT) with each block comprised of two multi-head attention modules and two 1D convolutions. (← over the phoneme sequence to capture local phoneme info, adjacent phonemes and mel-spectrogram features are more closely related in speech)
Variance Adapter
- phoneme embeddings(length n) —transform—> mel-spectrogram embeddings (length m). m is typically larger than n
- Duration Predictor : to solve length mismatch (each phonemes tends to map to one or multiple mel-frames). phonemes 을 받아서 duration 을 예측, 몇 개의 mel-frames 와 phoneme 이 매칭되는지를 나타내는 양수고, 이에 따라 다른 adapter에 들어가기 전에 phoneme seq 를 늘려줌
- Pitch Predictor : use continuous wavelet transform (CWT), during inference output is converted back to pitch contours using inverse CWT(iCWT), by minimizing the MSE between the spectrogram, mean, variance values of the gt and predictions
- Energy Predictor : estimates the original energy values for each STFT frame, for which it computes the L2-norm of the frame’s amplitude. 256 evenly distributed values 로 quantize, energy embedding 으로 encoding 하고 expanded hidden sequence 에 더해줌
Mel-Decoder and Postnet
- the variance adaptor’s hidden sequence —convert—> mel-spectrogram
- same architecture as the encoder but with 6 FFT blocks
- to improve mel-spectrogram quality(reducing artifacts and distortions in speech), Postnet is used at the mel-decoder’s output
Hypernetwork
- typically a small neural network that generates weights for a larger main network performing the usual learning task.
- by learning to adapter their parameters with the change in speaker, enhance the effectiveness of adapters
- d1-dimensional speaker embedding — speaker projector —> d2-dimensional space
- concatenate a dl-dimensional layer embedding(learnable look-up table that maps a layer-id to a vector), (d2+dl)-dimensional vector —source projector network—> ds-dimensional space
- for adapter down/up projection, sample weights from the hypernetwork through the source projector network using dedicated dense (Parameter Sampler) layers
- Utilizing a hypernetwork to customize adapter block weights for the TTS backbone network, significantly expand the adapter parameter space. this enabling input-conditioned parameter sampling. additionally continuous parameter space theoretically allows the generation of adapter parameters for numerous speakers without increasing hypernetwork parameters.
Experiments
Baseline models
- TTS-0 : zero-shot performance of the TTS model, pre-trained on LibriTTs and evaluated on target data without any fine-tuning(baseline lowerbound)
- Reference and Reference (Voc.) : GT speech, (Voc.) is transformed reference speech into mel-spectrograms and reconstruct the speech using HiFi-GAN
- TTS-FT (full fine-tuning) : obtained after fine-tuning all parameters of the backbone model on the target dataset(baseline upperbound)
- AdapterTTS : inserts bottleneck adapter modules, a down/up-projection layer. learns only adapter parameters, keeping the backbone parameters frozen. $AdapterTTS_e, AdapterTTS_v, AdapterTTS_d, AdapterTTS_{e/v/d}$ : bottleneck adapter block inserted in the encoder, VA, decoder, and combination of all.
- HyperTTS : bottleneck adapter block inserted in each module like AdapterTTS variants.$HyperTTS_e, HyperTTS_v, HyperTTS_d, HyperTTS_{e/v/d}$ the number of parameters : $HyperTTS_{e/v/d} > HyperTTS_e \approx HyperTTS_v \approx HyperTTS_d$
Datasets
- LTS : train-clean-100 subset ← TTS backbone pre-trained.
- LTS2 : dev-clean, test-clean ← adaptation
- VCTK ← adaptation
- divided into train and validation subsets
Evaluation Metrics
- Objective : cosine similarity(COS) and F0 Frame Error (FFE), Mel cepstral distortion (MCD), Word Error Rate (WER)
- Subjective : MOS, XAB test
Results and Discussions
- HyperTTS 는 TTS-FT 에 근접하는 성능, AdapterTTS 는 성능이 안좋고, multi-speaker TTS 에서는 안좋은 결과.
- MOS : TTS-FT > HYPERTTS_d(only used 0.422% of the parameters in TTS-FT) > AdapterTTS
- XAB test : HYPERTTS_d > AdapterTTS
- continuous space of parmeters, different refer speech from the same speaker clustered together while different spekaers are distant apart.
728x90
반응형