CLUNet: Using Contrastive Learning to Improve Traditional DAE Voice Isolation Methods

Marcus Bluestone

Benjamin Grey

Final project for 6.7960, MIT

Introduction: The Problem of Voice Isolation

As online interactions continue to expand across personal, professional, and social spaces, the need for reliable techniques to extract clean voice signals from noisy input streams has become increasingly popular. In the past few decades, Deep Learning-based approaches have gained particular traction, yielding ever more sophisticated and successful methods for isolating and enhancing human vocal signals in complex acoustic environments.

More specifically, given an audio signal that has some background noise and/or disturbances, can we remove the noise without affecting the quality of the underlying signal? For instance, consider the example below- can a model remove random Gaussian noise? Our research develops a novel approach to this problem. We publish our code at: https://github.com/MarcusBluestone/Voice_Isolation

Example of Noisy Environment. From left \(\rightarrow\) right, we have the clean signal, Gaussian noise w/ \(\sigma\) = .01, and Gaussian noise w/ \(\sigma\) = .05

Thesis: We propose and test a new model architecture CLUNet that combines Contrastive Loss-based methods with Denoising Autoencoder-based methods to generate higher quality reconstructions. We show that by combining the encodings learned via traditional denoising autoencoders and via contrastive loss, the training procedure is more stable and the reconstructed signals achieve a lower error than without it.

Let’s unpack what this means.

Background: Formalization of Voice Isolation

One can view the problem of cleaning a noisy audio signal as a specific instance of denoising, where the goal is to recover an underlying clean representation from a corrupted input. More formally, we sample some clean input \(x\) from our dataset of clean audio signals, augment with some randomized noise function \(g(x)\). We want to find a function \(f(x)\) that, on average, minimizes the expected reconstruction loss \(\mathcal{L}\) between the original clean signal and the output of \(f\). \[ \displaystyle \arg\min_{f(\cdot)} \; \mathbb{E}_{x \sim D_{\text{clean}}}[\mathcal{L}(x, f(g(x)))] \]

But what types of data are we even dealing with here?

Background: What is Sound?

Humans and computers handle sound very differently.

Physical Sound

What humans perceive as sound is simply a physical vibration through the air – the pushing and pulling of air molecules. See figure below. When we hear a sound, we do not actually hear the frequency or amplitude directly, but rather perceive certain qualities of the sound, such as loudness, pitch, and timbre. And yet, the task of voice isolation is generally quite easy for humans to perform. [4]

Example of an audio clip. Press to listen.

Digital Waveforms

Computers, on the other hand, require sound to be discretized in some fashion, and so the continuous pressure waveforms are converted into numbers by sampling the air pressure at very small time intervals. The frequency at which we sample is called the sample rate.

The sampled data is called a digitized waveform . Historically, waveforms were rarely used directly for voice-isolation or speech-processing tasks. Their raw form is highly detailed, noisy, and difficult to interpret using traditional signal-processing or statistical models. Only within the past decade—enabled by advances in deep learning for time-series data—have researchers begun to successfully perform voice isolation directly from waveform inputs, using architectures capable of learning the necessary features automatically. [1],[2],[3]

Waveform visualization — Digitized version of the audio signal above. We use a sample rate of 16 kHz.

Spectrograms

Beginning in the late 20th century, it became standard practice to transform speech waveforms into the time–frequency domain prior to analysis [5]. This was typically done using the short-time Fourier Transform (STFT), which produces a complex-valued spectrogram. The spectrogram may then be decomposed into its magnitude and phase components. In a spectrogram representation, the x-axis denotes time, the y-axis denotes frequency, and each point in the 2D plane encodes the magnitude or phase of the signal at that time–frequency location [6]. Despite the emergence of end-to-end deep learning models, spectrogram-based representations remain widely used today due to their stability, interpretability, and computational efficiency (note that often the log-magnitude is used for more stabilized training) [7],[8].

Spectrogram visualization — Amplitude & Phase Spectrograms of the audio sample above. See experimens sections for exact parameters used in the ST-FT.

Past Work: A Variety of Architectures

Depending on how the data is encoded, a variety of state-of-the-art methods exist for voice isolation:

Denoising Convolutional Autoencoders: In this framework, a clean sample is augmented with noise and then fed through a traditional autoencoder, with the original clean signal being used as the reconstruction target. When the audio signals are represented as spectrograms, standard Convolutional-Based Autoencoder architectures have proven to be highly effective at reconstruction.

Time Series Models: Some approaches use a time-based architecture (such as Dual-Path RNNs [13], LSTMs [14], or Transformers [1]) that act on the waveforms directly without spectrogram intermediaries. They utilize both local and global time information to predict the noise at each time step. These approaches have achieved high success rates, but suffer from higher computational overhead due to the large size of the input waveforms and more complex model architectures.
Contrastive Learning for Speech Processing: Contrastive learning can enforce useful similarities and dissimilarities in embeddings to improve voice isolation performance under different realistic conditions. For example, Noise-Aware Speech Separation (NASS) uses a patch-wise contrastive learning (PCL) objective to explicitly minimize mutual information between noisy background representations and speaker embeddings [15]. In a fully unsupervised setting, frame-level contrastive learning has also been used: Ochieng treats different frames from the same speaker as “augmentations” and pulls them together in representation space, then clusters them via deep modularization, which helps separate overlapping voices without needing permutation labels [16].
Our novel method CLUNet seeks to combine the highly successful results from convolutional denoising architectures and contrastive learning-based approaches. We avoid working with time-series data directly to avoid large computational overheads and, instead, propose new directions in the Future Works sections at the end.

Methodology: CLUNet

CL vs. DAE

To motivate why combining Contrastive Learning (CL) and Denoising Autoencoders (DAE) might produce better results, it’s important to understand the difference between the two approaches:

Contrastive learning (CL) : an unsupervised method which trains an encoder to map different views of the same signal to nearby locations in latent space, while mapping views of different signals to locations that are far apart in latent space. In our application, the “different views” correspond to different, random noise augmentations applied to the same clean signal. Note that this method only trains an encoder and doesn’t deal with reconstructing the clean signal.

Denoising Autoencoder (DAE): is a self-supervised method which trains both an encoder and decoder to recreate the original clean signal from a noisy augmented form of the signal.

The Idea Behind CLUNet

Despite seeming very different, there is a subtle connection between the two approaches. Both approaches require understanding the underlying clean structure of the noisy inputs. In order for an Encoder trained via Contrastive Learning to map noisy views of different clean signals to locations far apart in the latent space, it must be able to – in some way – uncover or understand the underlying clean signal structure. We hypothesize that this latent space information, therefore, should also be useful in the reconstruction task required by the DAE.

However, because the Encoder used in Contrastive Learning does not require any reconstruction task, we train two separate encoders. The full pipeline involves each encoder mapping a noisy input to a latent space representation, concatenating the two latent space representations, and then feeding them into a decoder that reconstructs the original clean signal. See Figure.

Training CLUNet

Training this pipeline is not a trivial task, and we propose other options in the Future Works section. In this project, we first train \(Enc_{CL}\) solely via contrastive loss. Afterwards, we freeze that encoder, and train the \(Enc_{REC}\) - \(Dec_{REC}\) pair using traditional MSE loss with the clean signal as target.

More specifically, we have two phases for our training:

Phase 1 - CL: Sample two clean signals from our dataset; randomly augment each of them \(n\) times; apply the ST-FT to get their amplitude spectrograms; pass them through the Enc_CL; then maximize the cosine-similarity between encodings from the same clean signal and minimize the cosine-similarity between encodings from the different signals. This is done using the equivalent formulation of the InfoNCE loss objective.

The InfoNCE loss takes the form below, where \(z_i\) is a noisy signal, \(z_i^+\) are other noisy augmentations of the same underlying clean signal as \(z_i\), and \(z_i^-\) are noisy augmentations of a DIFFERENT underlying clean signal than \(z_i\) \[ \begin{equation} \mathcal{L}_{\text{InfoNCE}} = - \frac{1}{N} \sum_{i=1}^{N} \log \frac{\exp\big(\text{sim}(z_i, z_i^+)/\tau\big)} {\exp\big(\text{sim}(z_i, z_i^+)/\tau\big) + \sum_{j=1}^{M} \exp\big(\text{sim}(z_i, z_j^-)/\tau\big)} \end{equation} \] Phase 2 - REC: Sample one clean signal from our dataset; randomly augment it; apply the ST-FT to obtain the amplitude spectrogram for both the clean & noisy signals; pass the noisy spectrogram through the Enc_REC and ENC_CL; concatenate their latent spaces and pass that through Dec_REC; minimize MSE between the original clean spectrogram and the reconstruction output of the decoder.

Note that we are only training our model to predict the original amplitude and not the phase, as is standard practice in these settings. This is because the human ear isn’t able to really hear the difference between clean & noisy phase, and phase spectrograms tend to look a lot noisier and are thus harder for the model to predict anyways. When creating the original waveform for auditory inspection, we simply use the noisy phase spectrogram [11].

Experiments & Results

Dataset

For our dataset of clean signals, we used PyTorch’s Librispeech library [17]. For training, we used the “train-clean-100” partition (~7 GB), which contains around 100 hours of clean speech from 4 different speakers. For the validation/test test, we used the “clean-dev” partition (~.7 GB), which contains 5 hours of clean speech. We randomly split the data into ~3-second intervals. The data was saved as waveforms with sample_rate = 16 kHz

To convert the waveforms into spectrograms, we used the ST-FT using a Hann window with size 510, and hop_length = 256 [6]. These parameters were chosen such that we would get perfect reconstruction when applying the inverse ST-FT to an uncorrupted, clean spectrogram to recover the original waveform; while still producing spectrograms that are reasonably sized (256 x 198). We padded the spectrograms to size (256, 256) before feeding them into the network.

Noise Augmentations

Gaussian Noise: We add Gaussian noise centered at 0, with standard deviation ranging from 0.01 to 0.5.
Environmental Noise: We used a Kaggle dataset that contains long clips of environmental noise, such as a “river”, “hallway”, or “office meeting”. We scaled the amplitude of these clips, ranging from values 10 to 100. Note that some clips contain human dialogue, not just environmental noise, and we explicitly test with those clips as noise to see if the model is still able to differentiate between it and the original audio.

Here's what adding the noise sounds like:

Example of Noisy Environment. Left is clean signal; middle is the noise (Gaussian w/ \(\sigma\) = .1); and right is the addition of the noise to the clean signal.

Example of Noisy Environment. Left is clean signal; middle is the noise (Environment w/ type=MEETING; scale = 40); and right is the addition of the noise to the clean signal.

Training & Experimental Parameters

We use the simple convolutional U-Net model described in [6] (without the transformer layer at the bottleneck).

We train our model using PyTorch; learning rate = 1e-4; optimizer = Adam; epochs = 20; batch_size = 128. We run experiments comparing a traditional DAE architecture, as compared to our novel architecture that combines both reconstruction loss and contrastive loss. We run the experiment for a variety of noise augmentation types (Gaussian w/ \(\sigma = .01, .1, .3, .5\) and Enviornment w/ scale = 10,50,70,100 and type = "MEETING"). For each experiment, we tracked both the train loss and validation loss per epoch. Total runtime for all of the experiments was approximately 15 hours.

Results

We first show the results for training just the contrastive learning encoder. See Figure below. The intensity of the noise increases from left to right.

There are a few noteworthy results:

The encoder is clearly learning, and is also converging to a stable value at the end of training
As we increase the amplitude of the noise (both for the environment and gaussian settings), the model does worse. This makes intuitive sense, as it becomes harder to uncover the original signal.
Learning in the Gaussian setting proved to be quite stable and near-instantaneous (with the final value being often near 0); while learning in the environmental setting was noisier w/ fluctuations that eventually converged to a stable value.

Next, we show a comparison of traditional denoising architectures vs our novel approach.

Our results demonstrate more consistent training with a lower final loss, compared to traditional denoising architectures. This validates our hypothesis that including the latent space information from the contrastive loss-based encoder improves the accuracy of the model when performing reconstruction. Moreover, the training procedure is much more stable when we first train w/ the contrastive loss.

To further validate our results that contrastive learning is useful in the learing process, we inspect the mean of the absolute values of the weights (MAW) in the convolutional layer at the bottleneck just before decoding. We compare the weights associated w/ the \(Enc_{REC}\) output and \(Enc_{CL}\) output. We note that the mean weights are pretty comparable, which indicates the model is using information from both encoders in the decoding process, across all noise augmentations.

Noise Used	MAW from \(Enc_{REC}\)	MAW from \(Enc_{CL}\)
G-01	0.03	0.02
G-1	0.04	0.02
G-3	0.03	0.02
G-5	0.03	0.04
E-10	0.03	0.02
E-50	0.03	0.02
E-70	0.03	0.02
E-100	0.02	0.02

The Mean Absolute Weight (MAW) from the convolutional layer at the bottlneck. The second column represents the MAW from the weights associated w/ the \(Enc_{REC}\) output and the third column represents the MAW from the weights associated w/ the \(Enc_{CL}\). The first column shows which noise augmentation the model is being trained with.

We also include some example audio reconstructions:

Clean signal

Augmented w/ Gaussian Noise (\(\sigma = 5\)). WATCH YOUR EARS!!!

Reconstructed w/ regular DAE architecture.

Reconstructed w/ novel CLUNet architecture.

Conclusion & Future Work

This research tackles the problem of voice isolation, i.e. extracting a clean audio signal from a noisy environment. We tested the novel thesis that combining the latent space representations from a contrastive-learning framework and from a traditional reconstruction framework allows for a more stable training procedure and more accurate final models. We prove this experimentally by demonstrating the effectiveness of our methodology on a variety of noise augmentations – Gaussian & Environmental – over a variety of different amplitudes.

However, there are still many future directions we hope to explore:

Noise Generalizability: Can a model trained on extracting a clean signal from a certain type of noise generalize to other types of noise? Can a model trained with Gaussian noise generalize to Environmental noise? What if we mix in a variety of different noises during train-time?
More Sophisticated Model Architectures: In this research, we used a standard U-Net to test our hypothesis as a simple proof of concept. But how well does this idea generalize to deeper models? Would this work for time-series models (e.g., RNN, LSTM, Transformer) that interact directly with the waveform without spectrograms?
Different Training Procedures: Training the voice pipeline was a non-trivial task. We only tried the approach of fully training the CL encoder and then training the reconstruction-based encoder/decoder afterward. However, other procedures are possible. Alternating epochs between CL loss and reconstruction loss might be viable, or using a single encoder trained with a joint reconstruction–contrastive objective.

References:

[1] K. Wang, B. He, and W.-P. Zhu, TSTNN: Two-stage Transformer based Neural Network for Speech Enhancement in the Time Domain, arXiv preprint arXiv:2103.09963, 2021.https://arxiv.org/pdf/2103.09963.

[2] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux, J. R. Hershey, and B. Schuller, Speech enhancement with LSTM recurrent neural networks and its application to noise‑robust ASR, in 12th International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA), Liberec, Czech Republic, August 25–28, 2015, Springer, Lecture Notes in Computer Science, vol. 9237, pp. 91–99. https://inria.hal.science/hal-01163493.

[3] C. Macartney and T. Weyde, Improved Speech Enhancement with the Wave‑U‑Net, arXiv preprint arXiv:1811.11307, 2018. https://arxiv.org/abs/1811.11307

[4] Sound, Wikipedia, https://en.wikipedia.org/wiki/Sound

[5] L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals, Prentice-Hall, 1978.

[6] Short‑time Fourier transform, Wikipedia, https://en.wikipedia.org/wiki/Short-time_Fourier_transform

[7] Y. Isik, J. Le Roux, Z. Chen, S. Watanabe, and J. R. Hershey, Single‑Channel Multi‑Speaker Separation using Deep Clustering, arXiv preprint arXiv:1607.02173, 2016. https://arxiv.org/abs/1607.02173

[8] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, A Regression Approach to Speech Enhancement Based on Deep Neural Networks, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 7–19, 2015. http://staff.ustc.edu.cn/~jundu/Publications/publications/Trans2015_Xu.pdf

[9] Towards Data Science, Speech Enhancement with Deep Learning, 2020. https://towardsdatascience.com/speech-enhancement-with-deep-learning-36a1991d3d8d/

[10] M. Xu, J. Chen, and Y. Wang, Self-Supervised Speech Denoising Using Only Noisy Audio Signals, arXiv preprint arXiv:2111.00242, 2021. https://arxiv.org/abs/2111.00242

[11] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks, arXiv preprint arXiv:1709.03658, 2017. https://arxiv.org/pdf/1709.03658

[12] X. Li, Y. Huang, and Z. Zhang, Context-Aware U-Net for Speech Enhancement in Time Domain, IEEE Transactions on Multimedia, 2021. https://ieeexplore.ieee.org/document/9401787

[13] Yi Luo, Zhuo Chen, Takuya Yoshioka Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation, 2020. https://arxiv.org/abs/1910.06379

[14] Felix Weninger, Hakan Erdogan, Shinji Watanabe, Emmanuel Vincent, Jonathan Le Roux, et al.. Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. 12th International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA), Aug 2015, Liberec, Czech Republic. ffhal-01163493f https://inria.hal.science/hal-01163493/document

[15] Zizheng Zhang, Chen Chen, Hsin-Hung Chen, Xiang Liu, Yuchen Hu, Eng Siong Chng. Noise-Aware Speech Separation with Contrastive Learning, 2024. https://arxiv.org/abs/2305.10761.com

[16] Peter Ochieng. Speech Separation based on Contrastive Learning and Deep Modularization, 2023. https://arxiv.org/abs/2305.10652

[17] PyTorch Documentation. https://docs.pytorch.org/audio/main/generated/torchaudio.datasets.LIBRISPEECH.html