Генеративные модели для улучшения речи тема диссертации и автореферата по ВАК РФ 00.00.00, кандидат наук Андреев Павел Константинович
- Специальность ВАК РФ00.00.00
- Количество страниц 185
Оглавление диссертации кандидат наук Андреев Павел Константинович
Contents
Introduction
Chapter 1 Background
1.1 Degradation Models
1.2 Metrics for Speech Enhancement
1.2.1 Subjective Metrics
1.2.2 Objective Metrics
1.3 MOS Prediction
1.4 Literature Review
1.4.1 Regression-based Approaches
1.4.2 GAN-based Approaches
1.4.3 Diffusion-based Approaches
1.4.4 Conclusions
Chapter 2 Generative Models for Basic Speech Enhancement
2.1 GANs for Speech Enhancement
2.2 HiFi++: a Unified Framework for Bandwidth Extension and Speech Enhancement
2.2.1 Adapting HiFi-GAN Generator For Bandwidth Extension and Speech Enhancement
2.2.2 Light, Simple and Fast Discriminators Reduce Training Complexity
2.2.3 Unified Framework for Bandwidth Extension and Speech Enhancement
2.2.4 Experiments
2.2.5 Conclusion
2.3 FFC-SE: Fast Fourier Convolution for Speech Enhancement
2.3.1 Fast Fourier Convolution
2.3.2 FFC-AE
2.3.3 FFC-UNet
2.3.4 Training
2.3.5 Experiments and Results
2.3.6 Conclusion
2.4 FINALLY: Fast and Universal Speech Enhancement With Studio-
like Quality
2.4.1 Perceptual Loss for Speech Generation
2.4.2 FINALLY
2.4.3 Results
2.4.4 Conclusion
Chapter 3 Iterative Autoregression for Streaming Speech Enhancement
3.1 Autoregressive Models for Waveform Generation
3.1.1 Background
3.1.2 Limitations of Teacher Forcing
3.2 Iterative Autoregression
3.2.1 Model Architecture
3.2.2 Experimental Series Description
3.3 Conclusion
Chapter 4 Unsupervised Speech Enhancement with Unconditional Diffusion Model
4.1 Background
4.1.1 Score-based Diffusion Models
4.1.2 Inverse Problems with Diffusion Models
4.2 UnDiff
4.2.1 Unconditional Speech Generation
4.2.2 Speech Inverse Tasks
4.3 Experiments and Discussion
4.3.1 Datasets
4.3.2 Metrics
4.3.3 Experimental Details
4.3.4 Unconditional Speech Generation
4.3.5 Inverse Tasks
4.4 Conclusion
Conclusions
Bibliography
List of Figures
List of Tables
Appendix: Russian Translation of the Ph.D. dissertation
Рекомендованный список диссертаций по специальности «Другие cпециальности», 00.00.00 шифр ВАК
Разработка эффективных параметризаций для генеративных состязательных сетей в задачах генерации изображений и речи2024 год, кандидат наук Аланов Айбек
Применение глубоких генеративных моделей для задач прогнозирования в машинном обучении2024 год, кандидат наук Баранчук Дмитрий Александрович
Моделирование геометрической формы, слияния и распределения кислорода в тканевых сфероидах доброкачественной ткани (Geometrical Shape, Coalescence and Oxygen Distribution Modeling in Benign Tissue Spheroids)2024 год, кандидат наук Вилински-Мазур Кэтрин Александровна
Алгоритмы для сетевых приложений и их теоретический анализ2022 год, доктор наук Николенко Сергей Игоревич
Введение диссертации (часть автореферата) на тему «Генеративные модели для улучшения речи»
Introduction
Topic of the Work
Speech recordings frequently suffer from background noise, reverberation, reduced frequency bandwidth, and other distortions, diminishing both their intelligibility and the listener's aesthetic satisfaction. Speech enhancement techniques are designed to restore perceptually plausible and intelligible clean speech from such corrupted signals. Speech enhancement can lower the technical requirements for recording equipment, enabling professionals to produce studio-quality recordings without sophisticated studio equipment. These models can be used to promote communication in acoustically contaminated environments and are particularly important for hearing assistance technologies for individuals with hearing impairments. Additionally, speech enhancement models play a critical role in creating high-quality datasets for deep learning systems, which rely on extensive datasets of clean speech for effective training. As a result, even publicly available data, irrespective of the initial recording conditions, can be improved to meet the quality standards necessary for advanced speech synthesis models. Therefore, speech enhancement techniques are crucial for a wide range of applications.
A more formal definition of the speech enhancement problem could be expressed in probabilistic formulation. Let p(y) be a clean speech distribution, and let p(x\y) be the degradation model. The problem of speech enhancement is to retrieve a sample from the conditional distribution p(y\x), where x ~ p(x\y) is a speech signal corrupted by the degradation model. The degradation model p(x\y) distribution can include various forms of signal transformation, including background noise injection, reduction of frequency bandwidth, codec artifacts, reverberation, etc. Depending on the available resources and the particular application, one can distinguish several formulations of the speech enhancement problem:
• Basic Speech Enhancement [137, 82]: This setting involves supervised speech enhancement without latency constraints. In this formulation, the degradation model distribution is known in advance, and the model is not restricted to be streaming.
• Streaming Speech Enhancement [47, 22]: In this scenario, it is required to build speech enhancement models with a limited algorithmic delay, i.e., causal models. Streaming speech enhancement enforces the model to use only a limited window of future information, typically ranging from 3-8 ms (low latency) to 60 ms (high latency).
• Unsupervised Speech Enhancement [76, 92]: This formulation does not assume the degradation model p(x\y) to be known in advance. The model is adapted to the particular degradation model only at the inference stage.
The development of efficient generative models for speech enhancement in the context of all of these formulations is the topic of this work.
Relevance
Historically, the field of audio processing has been dominated by methods that rely on handcrafted heuristics and statistical models, often employing unrealistic assumptions about the structure of speech and disturbances [29, 28]. However, the rise of machine learning and, more specifically, deep learning, has marked a paradigm shift towards leveraging data-driven methodologies. In contrast to traditional approaches, the data-driven paradigm learns the characteristics of the signals directly from the data. This approach has been shown to be beneficial in many domains, including speech processing.
Initial attempts to apply deep learning methods to the speech enhancement problem were based on treating this problem as a predictive problem [22, 37, 12, 47]. Following the principle of empirical risk minimization, the goal of predictive modeling is to find a model with minimal average error over the training data. Given a noisy waveform or spectrogram, these approaches try predicting the clean signal by minimizing point-wise distance in waveform and spectrum domains or jointly in both domains, thus treating this problem as a predictive task.
However, given severe degradations applied to the signal, there is an inherent uncertainty in the restoration of the speech signal (i.e., given the degraded signal, the clean signal is not restored unambiguously), which often leads to oversmoothing of the predicted speech. From the probabilistic point of view, minimization of the point-wise distance leads to an averaging effect. For example, optimization of the mean squared error between waveforms delivers the expectation of the waveform over the conditional distribution of clean speech given its degraded version. The key issue is that the expectation over this distribution is not guaranteed to lie within this distribution.
An illustrative example of this phenomenon and its impact on speech enhancement is shown in Figure 1. The model is trained to extend the frequency bandwidth of the speech signal given the signal with reduced bandwidth by minimizing the mean squared error distance between clean and generated waveforms [6]. Notably, the model is not able to restore high-frequency content while minimizing the MSE objective. Due to high uncertainty, the model over-smooths the high frequencies, being unable to restore speech content above 5 kHz. A similar effect occurs with other point-wise losses, including spectral-based losses.
Unlike predictive models, generative models aim at sampling from the clean speech distribution conditioned on the degraded signal rather than minimizing point-wise loss. The advantage of this approach is that the speech enhancement model is enforced to produce a signal lying within the clean speech distribution, as illustrated in Figure 1. The advantage of generative models over predictive models in application is confirmed by many recent works. This work studies generative models in the context of speech enhancement, proposes novel methods, and improves the efficiency and quality of existing solutions.
In the first part of this work, we focus on developing efficient solutions for basic speech enhancement [137, 82]. The basic speech enhancement formulation does not impose constraints on the available signal context, and the degradation model is assumed to be known during training. It is important to note that the speech enhancement problem does not necessitate the model to learn the complete conditional distribution p(y\x).
When given a noisy speech sample, the goal is typically to obtain the most probable clean speech sample that retains the lexical content and voice of the noisy sample. This is unlike applications such as text-to-image generation, where the objective is to generate a variety of images for each text prompt due to the higher level of uncertainty and the need for diverse options to select the best image. In contrast, for speech enhancement, it is not necessary to capture the entire conditional distribution. Instead, the focus is on capturing the mode of this distribution, which might be a simpler task.
We show that the GAN framework is naturally suited for this formulation of the speech enhancement problem since it tends to retrieve the main mode of the distribution—precisely what speech enhancement should typically do. Therefore, in this work, we employ the GAN framework for the basic speech enhancement formulation and design efficient architectures of generator and discriminator neural networks.
In the second part of the work, we focus on streaming speech enhancement [47, 22, 127, 142], in particular, on low-latency speech enhancement [127, 142]. Streaming models are an essential component of real-time speech enhancement tools. The streaming regime constrains speech enhancement models
Ground Truth
8000
' 2000
Input (MSE = 1e-4)
Predicted, MSE minimization (MSE = 5e-5)
Predicted, GAN training (MSE = 1e-3)
+ 0 dB -20 dB -40 dB -60 dB -80 dB
-20 dB -40 dB -60 dB -80 dB
+ 0 dB
-20 dB -40 dB -60 dB -80 dB
+ 0 dB
-20 dB -40 dB -60 dB -80 dB
Figure 1: Example of speech signal spectrograms. Mean squared error minimization leads to oversmoothing of the predicted signal, resulting in missing high-frequency content. While the prediction provided by the generative model delivers MSE even higher than the input, the predicted signal resembles the original speech content. A similar effect is reported for image super-resolution models [70].
to use only a tiny context of future information. As a result, the low-latency streaming setup is generally considered a challenging task and has a significant negative impact on the model's quality.
However, the sequential nature of streaming generation offers a natural possibility for autoregression, that is, utilizing previous predictions while making current ones. The conventional method for training autoregressive generative models is teacher forcing, but its primary drawback lies in the training-inference
mismatch that can lead to a substantial degradation in quality. We propose a straightforward yet effective alternative technique for training autoregressive low-latency speech enhancement models. We demonstrate that the proposed approach leads to stable improvement across diverse architectures and training scenarios.
Lastly, we focus on the unsupervised speech enhancement problem [76, 92]. We introduce a diffusion probabilistic model capable of solving various speech inverse tasks with unknown degradation models during training. Once trained for speech waveform generation in an unconditional manner, it can be adapted to different tasks including degradation inversion and neural vocoding.
In this subproblem, we first tackle the challenging problem of unconditional waveform generation by comparing different neural architectures and preconditioning domains. After that, we demonstrate how the trained unconditional diffusion model could be adapted to different tasks of speech processing by means of recent developments in post-training conditioning of diffusion models. Finally, we demonstrate the performance of the proposed technique on the tasks of bandwidth extension, declipping, and vocoding, and compare it to the baselines.
Key Results and Conclusions
Contributions
The main contributions of this work can be summarized as follows:
1. We propose the HiFi++ composite generator architecture by combining the HiFi-GAN generator with three new modules: SpectralUnet, Wave-UNet, and SpectralMaskNet. This new generator architecture allows building a unified framework for bandwidth extension and speech enhancement, delivering state-of-the-art results in these tasks.
2. We design a novel architecture for direct spectrogram estimation based on the fast Fourier convolution operator. The architecture allows direct manipulation with cepstrum features and further improves HiFi++ results on speech enhancement problems while being more parameter-efficient.
3. We investigate various feature extractors as backbones for speech perceptual loss and introduce criteria for selecting an extractor based on the structure of its feature space. The effectiveness of these criteria is validated by empirical results.
4. Based on these developments, we develop a novel universal speech enhancement model, FINALLY, which achieves state-of-the-art performance,
outperforming all existing solutions while being more computationally efficient.
5. We propose a novel method for training autoregressive models for low-latency streaming speech enhancement. The method allows mitigating training-inference mismatch arising during training with teacher forcing. The model allows a considerable improvement in streaming speech enhancement models with autoregressive conditioning.
6. We investigate a diffusion-based technique for unsupervised speech enhancement. The proposed unconditional diffusion model can be trained for the unconditional speech generation task and then be adapted for various speech restoration tasks without additional training.
Theoretical and Practical Significance
This work theoretically shows that adversarial training can be used for implicit regression for the main mode of the distribution, making it a suitable tool for learning speech enhancement models. It also studies the structural properties of different speech feature extractor spaces and formulates a new perceptual loss.
Additionally, the work proposes new neural architectures, HiFi++ and FFC-SE, for deep generative models, improving the quality and computational efficiency of speech enhancement solutions. Based on these developments, the work proposes a highly efficient speech enhancement algorithm, FINALLY, which achieves state-of-the-art quality with significantly fewer computational resources than prior methods.
The work also outlines a novel method for training autoregressive models in situations with high training-inference mismatch, significantly improving upon the conventional teacher forcing technique. The proposed iterative autoregression method holds significant practical novelty due to the widespread usage of autoregressive models nowadays.
Furthermore, the work studies the problem of unsupervised speech enhancement and proposes a novel diffusion generative model, Undiff, for unsupervised speech restoration. This work provides pioneering developments in the unsu-pervised speech enhancement problem.
Key Aspects/Ideas to be Defended
1. HiFi++ architecture for multi-domain signal processing in speech enhancement.
2. FFC-SE archietcure for direct complex spectrogram estimation.
3. FINALLY model for universal speech enhancement.
4. Iterative autoregression technique for mitigation of training-inference mismatch within autoregressive models, studied in application to low-latency speech enhancement.
5. Undiff diffusion probabilistic model for unsupervised speech enhancement.
Personal Contribution
The idea of HiFi++ and FFC-SE generator architectures was proposed by the author of this work. The initial implementation of the HiFi++ architecture was done by the author while Aibek Alanov and Oleg Ivanov helped to refine and prepare the codebase for experiments. The FFC-SE network was jointly developed with Ivan Shcheckotov. The experiments for validation of the networks' effectiveness were designed by the author. The implementation and paper writing were done jointly with Aibek Alanov, Oleg Ivanov, and Ivan Shcheckotov. Dmitry Vetrov provided scientific guidance for this work.
The proof of mode-seeking LS-GAN behavior and formulation of the speech enhancement problem as a mode-finding problem were developed by the author of the thesis. The FINALLY model was developed together with Kirill Tamogashev and Nicholas Babaev. The author was responsible for scientific guidance, experiment planning, and code review.
The iterative autoregression technique was proposed and theoretically studied by the author of this work. The actual implementation and experimental validation were done jointly with Nicholas Babaev. The paper was written by the author with some assistance from Nicholas Babaev and Aibek Alanov.
The Undiff generative model was designed and implemented jointly with Anastasia Yaschenko and Ivan Shcheckotov. Dmitry Vetrov provided scientific guidance for this work.
Publications and Probation of the Work
The results of this thesis are published in 3 first-tier publications and 1 second-tier publication. The PhD candidate is the main author in all of these articles1.
First-Tier Publications
• Andreev, P.*, Babaev, N.*, Saginbaev, A., Shchekotov, I., Alanov, A. (2023). Iterative autoregression: a novel trick to improve your low-
** indicates equal contribution
latency speech enhancement model. Proc. INTERSPEECH 2023, 24482452, doi: 10.21437/Interspeech.2023-365 (Core A)
• Shchekotov, I.*, Andreev, P.*, Ivanov, O., Alanov, A., Vetrov, D.
(2022). FFC-SE: Fast Fourier Convolution for Speech Enhancement. Proc. INTERSPEECH 2022, 1188-1192, doi: 10.21437/Interspeech.2022-603 (Core A)
• Iashchenko, A.*, Andreev, P.*, Shchekotov, I.*, Babaev, N., Vetrov, D.
(2023). UnDiff: Unsupervised Voice Restoration with Unconditional Diffusion Model. Proc. INTERSPEECH 2023, 4294-4298, doi: 10.21437/ Interspeech.2023-367 (Core A)
Second-Tier Publications
• P. Andreev*, A. Alanov, O. Ivanov* and D. Vetrov, "HIFI++: A Unified Framework for Bandwidth Extension and Speech Enhancement," ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023, pp. 1-5, doi: 10.1109/ICASSP49357.2023.10097255. (Core B)
Reports at Scientific Conferences
• 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Rhodes Island, Greece, June 8, 2023. Topic: "HIFI++: A Unified Framework for Bandwidth Extension and Speech Enhancement"
• 24th INTERSPEECH Conference, Dublin, Ireland, August 22, 2023. Topic: "Iterative autoregression: a novel trick to improve your low-latency speech enhancement model"
• 24th INTERSPEECH Conference, Dublin, Ireland, August 23, 2023. Topic: "UnDiff: Unsupervised Voice Restoration with Unconditional Diffusion Model"
• 23rd INTERSPEECH Conference, Incheon, Korea, September 20, 2022. Topic: "FFC-SE: Fast Fourier Convolution for Speech Enhancement"
Volume and Structure of the Work
The thesis contains an introduction chapter, which formulates the topic of this work, a background chapter, which introduces the necessary context, 3 content chapters, which describe the approaches developed for each of the introduced formulations, and a conclusion chapter, which summarizes the developments of this work and concludes the study. The full volume of the thesis is 101 pages.
Похожие диссертационные работы по специальности «Другие cпециальности», 00.00.00 шифр ВАК
Заключение диссертации по теме «Другие cпециальности», Андреев Павел Константинович
Заключение
В данной работе рассматривается проблема улучшения качества речи в трех формулировках: базовое улучшение речи, потоковое улучшение речи и улучшение речи без учителя. Для каждой из формулировок представлены теоретические и практические разработки, направленные на расширение возможностей алгоритмов улучшения речи. Основные выводы, сделанные на основе полученных результатов, включают следующее:
1. Композитные архитектуры генераторов с многодоменной обработкой обеспечивают лучший компромисс между качеством и сложностью моделей улучшения ширины полосы (BWE) и улучшения речи (SE). В частности, полезно дополнить аудиомодели модулями, выполняющими коррекцию сигналов как в временной, так и в спектральной области, для более эффективного использования параметров и снижения вычислительной сложности, как это показано в исследовании HiFi++.
2. Блоки сверток над быстрым преобразованием Фурье (Fast Fourier Convolution) являются эффективным архитектурным решением для разработки модулей обработки спектра. Глобальное воспринимающее поле этого слоя нейронной сети позволяет эффективно оценивать фазу сигнала, снижая при этом потребление памяти для хранения весов нейросети.
3. Теоретический анализ показывает, что обучение LS-GAN можно использовать для неявной регрессии к моде распределения, что естественным образом согласуется с практическими целями задачи улучшения речи. Практическая реализация обучения на основе GAN подтверждает этот анализ и демонстрирует, что модели на основе GAN способны обеспечивать быстрое и качественное улучшение речи, превосходя другие типы генеративных моделей с меньшими затратами ресурсов.
4. Авторегрессионное обуславливание может улучшить потоковые модели улучшения речи за счет использования информации о прошлых
предсказаниях во время вывода. Однако применение стандартной техники обучения авторегрессионных моделей, такой как "teacher forcing", приводит к значительному несоответствию между обучением и выводом, что, в свою очередь, снижает качество улучшения. Разработанный метод итеративной авторегрессии предлагает практическую альтернативу "teacher forcing" и позволяет эффективно и результативно обучать авторегрессионные модели улучшения речи.
5. Улучшение речи без учителя представляет собой значительную проблему из-за отсутствия известной модели деградации во время обучения. Проблема может быть решена с помощью безусловной диффузионной модели, которая используется для обучения априорного распределения сигналов речи и может адаптироваться к конкретной модели деградации во время вывода. К сожалению, наше исследование выявляет значительные трудности при использовании этого подхода, и полученные модели значительно уступают своим аналогам, обученным с учителем.
Направления для дальнейших исследований:
1. Функция потерь по сопоставлению признаков оказывается важной эвристикой для стабилизации обучения на основе GAN. Хотя исходное сопоставление признаков было предложено как эвристика для обучения вариационных автокодировщиков (VAE) с использованием перцепционной функции потерь, теоретические свойства этой функции остаются малоизученными. В то же время, теоретический анализ поиска моды GAN не учитывает функцию потерь по сопоставлению признаков, что является важным упущением. Исследование функции потерь по сопоставлению признаков является важным направлением для будущих исследований.
2. Предложенный алгоритм итеративной авторегрессии был изучен в режиме детерминированной генераторации. Это означает, что для каждого входного сегмента выходной сегмент вычисляется детерми-нированно. Такая процедура неизбежно приводит к несоответствию между распределением реальной волновой формы и генерируемой, поскольку распределение сегмента моделируется одной дельта-функцией. Важным вопросом для дальнейших исследований является поведение итеративной авторегрессии в условиях недетерминированной генерации, моделирующей все распределение выходного сегмента.
3. Разработанная диффузионная модель UnDiff показала ограниченное качество по сравнению с базовыми моделями, обученными с учителем. Вероятно, одной из основных причин этого является отсутствие
достаточной информации для обуславливания, что усложняет задачу для диффузионной модели. Задача полностью безусловного моделирования распределения волновой формы речи может оказаться слишком сложной для модели, и, следовательно, она не справляется с её решением должным образом с применяемыми ресурсами. Одним из перспективных направлений для упрощения этой задачи для диффузии является предоставление некоторой информации для обу-славливания без использования классификатора. Такая информация может быть связана с лингвистическим содержанием, голосом говорящего или с обоими параметрами. Дополнительные входные данные для обуславливания диффузионной модели могут упростить задачу и улучшить производительность улучшения речи без учителя.
4. Другое направление, заслуживающее детального анализа в контексте улучшения речи без учителя, заключается в обобщении универсальных моделей улучшения речи за пределы множества моделей деградации, наблюдаемых во время обучения. Универсальное улучшение речи, подход, нацеленный на обобщение по широкому спектру моделей деградации в процессе управляемого обучения, может рассматриваться как более практичная альтернатива диффузионному улучшению речи без учителя.
Список литературы диссертационного исследования кандидат наук Андреев Павел Константинович, 2024 год
Список литературы
[1] Pavel Andreev h gp. «Hifi++: a unified framework for bandwidth extension and speech enhancement». B: arXiv preprint arXiv:2203.13086 (2022).
[2] Alexei Baevski h gp. «wav2vec 2.0: A framework for self-supervised learning of speech representations». B: arXiv preprint arXiv:2006.11477 (2020).
[3] Samy Bengio h gp. «Scheduled sampling for sequence prediction with recurrent neural networks». B: Advances in neural information processing systems 28 (2015).
[4] Mikolaj Binkowski h gp. «High fidelity speech synthesis with adversarial networks». B: arXiv preprint arXiv:1909.11646 (2019).
[5] Sawyer Birnbaum h gp. «Temporal film: Capturing long-range sequence dependencies with feature-wise modulations». B: arXiv preprint arXiv:1909.06628 (2019).
[6] Zalan Borsos h gp. «Audiolm: a language modeling approach to audio generation». B: arXiv preprint arXiv:2209.03143 (2022).
[7] Andrew Brock, Jeff Donahue h Karen Simonyan. «Large scale GAN training for high fidelity natural image synthesis». B: arXiv preprint arXiv:1809.11096 (2018).
[8] Tom Brown h gp. «Language models are few-shot learners». B: Advances in neural information processing systems 33 (2020), c. 1877—1901.
[9] Jaeuk Byun h gp. «An Empirical Study on Speech Restoration Guided by Self-Supervised Speech Representation». B: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2023, c. 1—5.
[10] Edresson Casanova h gp. «Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone». B: International Conference on Machine Learning. PMLR. 2022, c. 2709—2720.
[11] Jun Chen h gp. «Fullsubnet+: Channel attention fullsubnet with complex spectrograms for speech enhancement». B: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2022, c. 7857—7861.
[12] Sanyuan Chen h gp. «Wavlm: Large-scale self-supervised pre-training for full stack speech processing». B: IEEE Journal of Selected Topics in Signal Processing 16.6 (2022), c. 1505—1518.
[13] Lu Chi, Borui Jiang h Yadong Mu. «Fast fourier convolution». B: Advances in Neural Information Processing Systems 33 (2020), c. 4479—4488.
[14] Jooyoung Choi h gp. «ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models». B: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, c. 14367—14376.
[15] Yu-An Chung h gp. «W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training». B: 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE. 2021, c. 244—250.
[16] Hyungjin Chung h gp. «Diffusion posterior sampling for general noisy inverse problems». B: arXiv preprint arXiv:2209.14687 (2022).
[17] George Close, Thomas Hain h Stefan Goetze. «The Effect of Spoken Language on Speech Enhancement using Self-Supervised Speech Representation Loss Functions». B: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE. 2023, c. 1—5.
[18] George Close h gp. «Perceive and predict: self-supervised speech representation based loss functions for speech enhancement». B: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2023, c. 1—5.
[19] Yann N Dauphin h gp. «Language modeling with gated convolutional networks». B: International conference on machine learning. PMLR. 2017, c. 933—941.
[20] Alexandre Defossez, Gabriel Synnaeve h Yossi Adi. «Real Time Speech Enhancement in the Waveform Domain». B: Interspeech. 2020.
[21] Alexandre Defossez h gp. «High Fidelity Neural Audio Compression». B: Transactions on Machine Learning Research (2023).
[22] Alexandre Defossez h gp. «Music Source Separation in the Waveform Domain». B: arXiv preprint arXiv:1911.13254 (2019).
[23] Qingyun Dou, Joshua Efiong h Mark JF Gales. «Attention Forcing for Speech Synthesis.» B: Interspeech. 2020, c. 4014—4018.
[24] Harishchandra Dubey h gp. «Icassp 2022 deep noise suppression challenge». B: arXiv preprint arXiv:2202.13288 (2022).
[25] Yariv Ephraim. «Statistical-model-based speech enhancement systems». B: Proceedings of the IEEE 80.10 (1992), c. 1526—1555.
[26] Yariv Ephraim h David Malah. «Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator». B: IEEE Transactions on acoustics, speech, and signal processing 32.6 (1984), c. 1109—1121.
[27] Szu-Wei Fu h gp. «Metricgan: Generative adversarial networks based black-box metric scores optimization for speech enhancement». B: International Conference on Machine Learning. PMLR. 2019, c. 2031—2041.
[28] Szu-Wei Fu h gp. «MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement». B: arXiv preprint arXiv:2104.03538 (2021).
[29] Karan Goel h gp. «It's raw! audio generation with state-space models». B: International Conference on Machine Learning. PMLR. 2022, c. 7616— 7633.
[30] Augustine Gray h John Markel. «Distance measures for speech processing». B: IEEE Transactions on Acoustics, Speech, and Signal Processing 24.5 (1976), c. 380—391.
[31] Daniel Griffin h Jae Lim. «Signal estimation from modified short-time Fourier transform». B: IEEE Transactions on acoustics, speech, and signal processing 32.2 (1984), c. 236—243.
[32] Xiang Hao h gp. «FullSubNet: a full-band and sub-band fusion model for real-time single-channel speech enhancement». B: ICASSP 20212021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2021, c. 6633—6637.
[33] Jonathan Ho, Ajay Jain h Pieter Abbeel. «Denoising diffusion probabilistic models». B: Advances in Neural Information Processing Systems 33 (2020), c. 6840—6851.
[34] Sepp Hochreiter h Jiirgen Schmidhuber. «Long short-term memory». B: Neural computation 9.8 (1997), c. 1735—1780.
[35] Tsun-An Hsieh h gp. Improving Perceptual Quality by Phone-Fortified Perceptual Loss using Wasserstein Distance for Speech Enhancement. 2021. arXiv: 2010.15174 [cs.SD].
[36] Tsun-An Hsieh h gp. «Improving perceptual quality by phone-fortified perceptual loss using wasserstein distance for speech enhancement». B: arXiv preprint arXiv:2010.15174 (2020).
[37] Wei-Ning Hsu h gp. «Hubert: Self-supervised speech representation learning by masked prediction of hidden units». B: IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), c. 3451—3460.
[38] Jie Hu, Li Shen h Gang Sun. «Squeeze-and-Excitation Networks». B: 2018.
[39] Kuo-Hsuan Hung h gp. «Boosting self-supervised embeddings for speech enhancement». B: arXiv preprint arXiv:2204.03339 (2022).
[40] Jeff Hwang h gp. TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch. 2023. arXiv: 2310.17864 [eess.AS].
[41] Umut Isik h gp. «Poconet: Better speech enhancement with frequency-positional embeddings, semi-supervised conversational data, and biased loss». B: arXiv preprint arXiv:2008.04470 (2020).
[42] Keith Ito h Linda Johnson. The LJ Speech Dataset. https://keithito. com/LJ-Speech-Dataset/. 2017.
[43] Jesper Jensen h Cees H Taal. «An algorithm for predicting the intelligibility of speech masked by modulated noise maskers». B: IEEE/ACM Transactions on Audio, Speech, and Language Processing 24.11 (2016), c. 2009—2022.
[44] Don H Johnson. «Signal-to-noise ratio». B: Scholarpedia 1.12 (2006), c. 2088.
[45] Nal Kalchbrenner h gp. «Efficient neural audio synthesis». B: International Conference on Machine Learning. PMLR. 2018, c. 2410—2419.
[46] Tero Karras h gp. «Elucidating the Design Space of Diffusion-Based Generative Models». B: Advances in Neural Information Processing Systems.
[47] Eesung Kim h Hyeji Seo. «SE-Conformer: Time-Domain Speech Enhancement Using Conformer.» B: Interspeech. 2021, c. 2736—2740.
[48] Jaehyeon Kim, Jungil Kong h Juhee Son. «Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech». B: International Conference on Machine Learning. PMLR. 2021, c. 5530—5540.
[49] Jaeyoung Kim, Mostafa El-Khamy h Jungwon Lee. «End-to-end multitask denoising for joint SDR and PESQ optimization». B: arXiv preprint arXiv:1901.09146 (2019).
[50] Yuma Koizumi h gp. «Libritts-r: A restored multi-speaker text-to-speech corpus». B: arXiv preprint arXiv:2305.18802 (2023).
[51] Yuma Koizumi h gp. «Miipher: A robust speech restoration model integrating self-supervised speech and text representations». B: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE. 2023, c. 1—5.
[52] Yuma Koizumi h gp. «WaveFit: An iterative and non-autoregressive neural vocoder based on fixed-point iteration». B: 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE. 2023, c. 884—891.
[53] Jungil Kong, Jaehyeon Kim h Jaekyoung Bae. «Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis». B: arXiv preprint arXiv:2010.05646 (2020).
[54] Qiuqiang Kong h gp. «Decoupling magnitude and phase estimation with deep resunet for music source separation». B: arXiv preprint arXiv:2109.05418 (2021).
[55] Zhifeng Kong h gp. «Diffwave: A versatile diffusion model for audio synthesis». B: arXiv preprint arXiv:2009.09761 (2020).
[56] Volodymyr Kuleshov, S Zayd Enam h Stefano Ermon. «Audio super resolution using neural networks». B: arXiv preprint arXiv:1708.00853 (2017).
[57] Kundan Kumar h gp. «Melgan: Generative adversarial networks for conditional waveform synthesis». B: arXiv preprint arXiv:1910.06711 (2019).
[58] Alex M Lamb h gp. «Professor forcing: A new algorithm for training recurrent networks». B: Advances in neural information processing systems 29 (2016).
[59] Anders Boesen Lindbo Larsen h gp. «Autoencoding beyond pixels using a learned similarity metric». B: International conference on machine learning. PMLR. 2016, c. 1558—1566.
[60] Bunlong Lay h gp. «Reducing the Prior Mismatch of Stochastic Differential Equations for Diffusion-based Speech Enhancement». B: arXiv preprint arXiv:2302.14748 (2023).
[61] Jonathan Le Roux h gp. «SDR-half-baked or well done?» B: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2019, c. 626—630.
[62] Christian Ledig h gp. «Photo-realistic single image super-resolution using a generative adversarial network». B: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, c. 4681— 4690.
[63] Jean-Marie Lemercier h gp. «Diffusion Models for Audio Restoration». B: arXiv preprint arXiv:2402.09821 (2024).
[64] Jean-Marie Lemercier h gp. «StoRM: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation». B: IEEE/ACM Transactions on Audio, Speech, and Language Processing (2023).
[65] Cheuk Ting Li h Farzan Farnia. «Mode-seeking divergences: theory and applications to GANs». B: International Conference on Artificial Intelligence and Statistics. PMLR. 2023, c. 8321—8350.
[66] Naihan Li h gp. «Robutrans: A robust transformer-based text-to-speech model». B: Proceedings of the AAAI Conference on Artificial Intelligence. T. 34. 05. 2020, c. 8228—8235.
[67] Hsin-Yi Lin h gp. «Unsupervised noise adaptive speech enhancement by discriminator-constrained optimal transport». B: Advances in Neural Information Processing Systems 34 (2021), c. 19935—19946.
[68] Ju Lin h gp. «A Two-stage Approach to Speech Bandwidth Extension». B: Proc. Interspeech 2021 (2021), c. 1689—1693.
[69] Haohe Liu h gp. «AudioLDM: Text-to-Audio Generation with Latent Diffusion Models». B: arXiv preprint arXiv:2301.12503 (2023).
[70] Haohe Liu h gp. «Voicefixer: A unified framework for high-fidelity speech restoration». B: arXiv preprint arXiv:2204.05841 (2022).
[71] Haohe Liu h gp. «VoiceFixer: Toward General Speech Restoration with Neural Vocoder». B: arXiv preprint arXiv:2109.13731 (2021).
[72] Chen-Chou Lo h gp. «MOSNet: Deep learning based objective assessment for voice conversion». B: arXiv preprint arXiv:1904.08352 (2019).
[73] Philipos C Loizou. Speech enhancement: theory and practice. CRC press, 2007.
[74] Jaime Lorenzo-Trueba h gp. «The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods». B: arXiv preprint arXiv:1804.04262 (2018).
[75] Ilya Loshchilov h Frank Hutter. «Fixing Weight Decay Regularization in Adam». B: CoRR abs/1711.05101 (2017). arXiv: 1711.05101. url: http://arxiv.org/abs/1711.05101.
[76] Yen-Ju Lu, Yu Tsao h Shinji Watanabe. «A study on speech enhancement based on diffusion probabilistic model». B: 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE. 2021, c. 659—666.
[77] Yen-Ju Lu h gp. «Conditional diffusion probabilistic model for speech enhancement». B: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2022, c. 7402—7406.
[78] James MacQueen h gp. «Some methods for classification and analysis of multivariate observations». B: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. T. 1. 14. Oakland, CA, USA. 1967, c. 281—297.
[79] Pranay Manocha h gp. «CDPAM: Contrastive learning for perceptual audio similarity». B: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2021, c. 196—200.
[80] Xudong Mao h gp. «Least squares generative adversarial networks». B: Proceedings of the IEEE international conference on computer vision. 2017, c. 2794—2802.
[81] Eloi Moliner, Jaakko Lehtinen h Vesa Valimaki. «Solving Audio Inverse Problems with a Diffusion Model». B: arXiv preprint arXiv:2210.15228 (2022).
[82] Eloi Moliner, Jaakko Lehtinen h Vesa Valimaki. «Solving audio inverse problems with a diffusion model». B: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2023, c. 1—5.
[83] Max Morrison h gp. «Chunked autoregressive gan for conditional waveform synthesis». B: International Conference on Learning Representations (2021).
[84] Gautham J Mysore. «Can we automatically transform speech recorded on common consumer devices in real-world environments into professional production quality speech?—a dataset, insights, and challenges». B: IEEE Signal Processing Letters 22.8 (2014), c. 1006—1010.
[85] Arsha Nagrani, Joon Son Chung h Andrew Zisserman. «Voxceleb: a large-scale speaker identification dataset». B: arXiv preprint arXiv:1706.08612 (2017).
[86] Alexander Quinn Nichol h Prafulla Dhariwal. «Improved denoising diffusion probabilistic models». B: International Conference on Machine Learning. PMLR. 2021, c. 8162—8171.
[87] Henri J Nussbaumer. «The fast Fourier transform». B: Fast Fourier Transform and Convolution Algorithms. Springer, 1981, c. 80—111.
[88] Aaron van den Oord h gp. «WaveNet: A Generative Model for Raw Audio». B: 9th ISCA Speech Synthesis Workshop, c. 125—125.
[89] Tom Le Paine h gp. «Fast wavenet generation algorithm». B: arXiv preprint arXiv:1611.09482 (2016).
[90] Ryan Prenger, Rafael Valle h Bryan Catanzaro. «Waveglow: A flow-based generative network for speech synthesis». B: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2019, c. 3617—3621.
[91] William M Rand. «Objective criteria for the evaluation of clustering methods». B: Journal of the American Statistical association 66.336 (1971), c. 846—850.
[92] Chandan KA Reddy, Vishak Gopal h Ross Cutler. «DNSMOS P. 835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors». B: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2022, c. 886—890.
[93] Dario Rethage, Jordi Pons h Xavier Serra. «A wavenet for speech denoising». B: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2018, c. 5069—5073.
[94] Julius Richter h gp. «Speech Enhancement and Dereverberation with Diffusion-based Generative Models». B: IEEE/ACM Transactions on Audio, Speech, and Language Processing 31 (2023), c. 2351—2364. doi: 10.1109/TASLP.2023.3285241.
[95] Julius Richter h gp. «Speech enhancement and dereverberation with diffusion-based generative models». B: arXiv preprint arXiv:2208.05830 (2022).
[96] Antony W Rix h gp. «Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs». B: 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221). T. 2. IEEE. 2001, c. 749—752.
[97] Olaf Ronneberger, Philipp Fischer h Thomas Brox. «U-net: Convolutional networks for biomedical image segmentation». B: International Conference on Medical image computing and computer-assisted intervention. Springer. 2015, c. 234—241.
[98] Simon Rouard h Gaetan Hadjeres. «CRASH: Raw Audio Score-based Generative Modeling for Controllable High-resolution Drum Sound Synthesis». B: Music Information Retrieval Conf. (ISMIR). 2021, c. 579—585.
[99] Takaaki Saeki h gp. «Utmos: Utokyo-sarulab system for voicemos challenge 2022». B: arXiv preprint arXiv:2204.02152 (2022).
100] Hiroshi Sato h gp. «Downstream task agnostic speech enhancement with self-supervised representation loss». B: arXiv preprint arXiv:2305.14723 (2023).
101] Robin Scheibler h gp. «Diffusion-based Generative Speech Source Separation». B: arXiv preprint arXiv:2210.17327 (2022).
102] Joan Serra h gp. «Universal speech enhancement with score-based diffusion». B: arXiv preprint arXiv:2206.03065 (2022).
103] Ivan Shchekotov h gp. «FFC-SE: Fast Fourier Convolution for Speech Enhancement». B: Proc. Interspeech 2022. 2022, c. 1188—1192. doi: 10.21437/Interspeech.2022-603.
104] Jonathan Shen h gp. «Non-attentive tacotron: Robust and controllable neural TTS synthesis including unsupervised duration modeling». B: arXiv preprint arXiv:2010.04 301 (2020).
105] Yang Song h gp. «Score-Based Generative Modeling through Stochastic Differential Equations». B: International Conference on Learning Representations.
106] Daniel Stoller, Sebastian Ewert h Simon Dixon. «Wave-u-net: A multi-scale neural network for end-to-end audio source separation». B: arXiv preprint arXiv:1806.03185 (2018).
107] Jiaqi Su, Zeyu Jin h Adam Finkelstein. «HiFi-GAN-2: Studio-quality speech enhancement via generative adversarial networks conditioned on acoustic features». B: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE. 2021, c. 166— 170.
108] Jiaqi Su, Zeyu Jin h Adam Finkelstein. «HiFi-GAN: High-fidelity denoising and dereverberation based on speech deep features in adversarial networks». B: arXiv preprint arXiv:2006.05694 (2020).
109] Roman Suvorov h gp. «Resolution-robust Large Mask Inpainting with Fourier Convolutions». B: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2022, c. 2149—2159.
110] Cees H Taal h gp. «An algorithm for intelligibility prediction of time-frequency weighted noisy speech». B: IEEE Transactions on Audio, Speech, and Language Processing 19.7 (2011), c. 2125—2136.
111] Marco Tagliasacchi h gp. «SEANet: A multi-modal speech enhancement network». B: arXiv preprint arXiv:2009.02095 (2020).
[112] Qiao Tian h gp. «TFGAN: Time and frequency domain based generative adversarial network for high-fidelity speech synthesis». B: arXiv preprint arXiv:2011.12206 (2020).
[113] Zehai Tu h gp. «A Two-Stage End-to-End System for Speech-in-Noise Hearing Aid Processing». B: Proc. Clarity (2021), c. 3—5.
[114] Cassia Valentini-Botinhao h gp. «Noisy speech database for training speech enhancement algorithms and tts models». B: (2017).
[115] Ravichander Vipperla h gp. «Bunched LPCNet: Vocoder for Low-Cost Neural Text-To-Speech System». B: Proc. Interspeech 2020 (2020), c. 3565— 3569.
[116] Yongqi Wang h Zhou Zhao. «Fastlts: Non-autoregressive end-to-end unconstrained lip-to-speech synthesis». B: Proceedings of the 30th ACM International Conference on Multimedia. 2022, c. 5678—5687.
[117] Ziqian Wang h gp. «SELM: Speech Enhancement using Discrete Tokens
and Language Models». B: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2024, c. 11561—11565.
[118] Simon Welker, Julius Richter h Timo Gerkmann. «Speech enhancement with score-based generative models in the complex STFT domain». B: arXiv preprint arXiv:2203.17004 (2022).
[119] Ronald J Williams h David Zipser. «A learning algorithm for continually running fully recurrent neural networks». B: Neural computation 1.2 (1989), c. 270—280.
[120] WV-MOS: MOS score prediction by fine-tuned wav2vec2.0 model. https: //github.com/AndreevP/wvmos. Accessed: 2022-01-20.
[121] Yong Xu h gp. «A regression approach to speech enhancement based on deep neural networks». B: IEEE/ACM transactions on audio, speech, and language processing 23.1 (2014), c. 7—19.
[122] Junichi Yamagishi, Christophe Veaux, Kirsten MacDonald h gp. «Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92)». B: (2019).
[123] Pavel Zaviska h Pavel Rajmic. «Analysis Social Sparsity Audio Declipper». B: arXiv preprint arXiv:2205.10215 (2022).
[124] Pavel Zaviska h gp. «A proper version of synthesis-based sparse audio declipper». B: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2019, c. 591— 595.
[125] Richard Zhang h gp. «The unreasonable effectiveness of deep features as a perceptual metric». B: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, c. 586—595.
[126] K Zmolikova h JH Cernock. «BUT System for the First Clarity Enhancement Challenge». B: Proceedings of Clarity (2021), c. 1—3.
Обратите внимание, представленные выше научные тексты размещены для ознакомления и получены посредством распознавания оригинальных текстов диссертаций (OCR). В связи с чем, в них могут содержаться ошибки, связанные с несовершенством алгоритмов распознавания. В PDF файлах диссертаций и авторефератов, которые мы доставляем, подобных ошибок нет.