관심 있는 부분만 요약했습니다..https://arxiv.org/pdf/2303.03926conditional language modeling task with neural codec codesAR language model 를 사용한 paired phoneme sequences 로부터 첫번째 Encodec quantizer 로 audio codec 을 만들고, 그 코덱들을 나머지 quantizer 로 병렬로 NAR model 을 이용해서 코드를 생성한다multilingual autoregressive codec LM, multilingual non-autoregressive codec LM 이 acoustic tokens 를 서로 다르게 세부적으로 생성acoustic quantizer, vall-e 에..
https://arxiv.org/abs/1609.03499 WaveNet: A Generative Model for Raw AudioThis paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that arxiv.orgv1 2016, v2 2016IntroductionJoint probabilities 를 pixel/wor..
https://arxiv.org/abs/2006.04558 FastSpeech 2: Fast and High-Quality End-to-End Text to SpeechNon-autoregressive text to speech (TTS) models such as FastSpeech can synthesize speech significantly faster than previous autoregressive models with comparable quality. The training of FastSpeech model relies on an autoregressive teacher model for duratioarxiv.orgv1 2020, v8 2022ICLR 2021Introduction기존..
https://arxiv.org/abs/2404.04645 HyperTTS: Parameter Efficient Adaptation in Text to Speech using HypernetworksNeural speech synthesis, or text-to-speech (TTS), aims to transform a signal from the text domain to the speech domain. While developing TTS architectures that train and test on the same set of speakers has seen significant improvements, out-of-domain speaarxiv.orgContributionDynamic Ad..
ICASSP 2022https://arxiv.org/abs/2110.03857IntroductionIn research, the text content of training and test data are often highly similar and in the same text domain. For many real-world applications, TTS systems need to deal with text input with arbitrary content across a wide range of domains.specific target speakers 로부터의 데이터를 증가시키는 것 : costly or impractical → “non-target” speakers 의 데이터를 사용하는 것..
Interspeech 2022https://arxiv.org/abs/2110.05798Contributionpresent transfer learning methods and guidelines for finetuning single-speaker TTS models for a new voiceevaluate and provide a detailed analysis with varying amount of datademonstrate that transfer learning can substantially reduce the training time and amount of data needed for synthesizing a new voiceopen-source framework, provide a ..
https://arxiv.org/abs/2406.09569 Speech ReaLLM -- Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of TimeWe introduce Speech ReaLLM, a new ASR architecture that marries "decoder-only" ASR with the RNN-T to make multimodal LLM architectures capable of real-time streaming. This is the first "decoder-only" ASR architecture designed to handle continuous audio witarxi..
Abstract Resolution-connected generator, Resolution-wise discriminator 제안 더불어 정확성있게 high-frequency components 재생산을 위해 discriminators 안에서 discrete wavelet transform 이용 Fre-GAN은 MOS에서 Ground-truth audio와 0.03 정도의 차이만 난다. 1. Introduction autoregressive model 들은 좋은 성능을 보여주지만 느린 인퍼런스 속도 이들의 구조적 한계를 해결하기 위해 flow-based vocoders 가 제안되었다. 자연스러운 waveform을 실시간으로 생성함에도 불구하고 병렬적으로 noise sequence를 raw wavefor..