| CLUNet: Using Contrastive Learning to Improve Traditional DAE Voice Isolation Methods | |||
| Marcus Bluestone | Isaac (Zack) Duitz | Benjamin Grey | |
| Final project for 6.7960, MIT | |||
| CLUNet: Using Contrastive Learning to Improve Traditional DAE Voice Isolation Methods | |||
| Marcus Bluestone | Isaac (Zack) Duitz | Benjamin Grey | |
| Final project for 6.7960, MIT | |||
As online interactions continue to expand across personal, professional, and social spaces, the need for reliable techniques to extract clean voice signals from noisy input streams has become increasingly popular. In the past few decades, Deep Learning-based approaches have gained particular traction, yielding ever more sophisticated and successful methods for isolating and enhancing human vocal signals in complex acoustic environments.
More specifically, given an audio signal that has some background noise and/or disturbances, can we remove the noise without affecting the quality of the underlying signal? For instance, consider the example below- can a model remove random Gaussian noise? Our research develops a novel approach to this problem. We publish our code at: https://github.com/MarcusBluestone/Voice_Isolation
Thesis: We propose and test a new model architecture CLUNet that combines Contrastive Loss-based methods with Denoising Autoencoder-based methods to generate higher quality reconstructions. We show that by combining the encodings learned via traditional denoising autoencoders and via contrastive loss, the training procedure is more stable and the reconstructed signals achieve a lower error than without it.
Let’s unpack what this means.
But what types of data are we even dealing with here?
Humans and computers handle sound very differently.
What humans perceive as sound is simply a physical vibration through the air – the pushing and pulling of air molecules. See figure below. When we hear a sound, we do not actually hear the frequency or amplitude directly, but rather perceive certain qualities of the sound, such as loudness, pitch, and timbre. And yet, the task of voice isolation is generally quite easy for humans to perform. [4]
Computers, on the other hand, require sound to be discretized in some fashion,
and so the continuous pressure waveforms are converted into numbers by sampling
the air pressure at very small time intervals. The frequency at which we sample
is called the sample rate.
The sampled data is called a digitized waveform . Historically,
waveforms were rarely used directly for voice-isolation or speech-processing
tasks. Their raw form is highly detailed, noisy, and difficult to interpret
using traditional signal-processing or statistical models. Only within the
past decade—enabled by advances in deep learning for time-series data—have
researchers begun to successfully perform voice isolation directly from waveform
inputs, using architectures capable of learning the necessary features automatically.
[1],[2],[3]
Beginning in the late 20th century, it became standard practice to transform speech waveforms into the time–frequency domain prior to analysis [5]. This was typically done using the short-time Fourier Transform (STFT), which produces a complex-valued spectrogram. The spectrogram may then be decomposed into its magnitude and phase components. In a spectrogram representation, the x-axis denotes time, the y-axis denotes frequency, and each point in the 2D plane encodes the magnitude or phase of the signal at that time–frequency location [6]. Despite the emergence of end-to-end deep learning models, spectrogram-based representations remain widely used today due to their stability, interpretability, and computational efficiency (note that often the log-magnitude is used for more stabilized training) [7],[8].
Depending on how the data is encoded, a variety of state-of-the-art methods exist for voice isolation:
Our novel method CLUNet seeks to combine the highly successful results from convolutional denoising architectures and contrastive learning-based approaches. We avoid working with time-series data directly to avoid large computational overheads and, instead, propose new directions in the Future Works sections at the end.
To motivate why combining Contrastive Learning (CL) and Denoising Autoencoders (DAE)
might produce better results, it’s important to understand the difference between the
two approaches:
Contrastive learning (CL) : an unsupervised method which trains
an encoder to map different views of the same signal to nearby locations in
latent space, while mapping views of different signals to locations that are
far apart in latent space. In our application, the “different views” correspond
to different, random noise augmentations applied to the same clean signal. Note
that this method only trains an encoder and doesn’t deal with reconstructing
the clean signal.
Denoising Autoencoder (DAE): is a self-supervised method which
trains both an encoder and decoder to recreate the original clean signal
from a noisy augmented form of the signal.
Despite seeming very different, there is a subtle connection
between the two approaches. Both approaches require understanding
the underlying clean structure of the noisy inputs. In order for an
Encoder trained via Contrastive Learning to map noisy views of different
clean signals to locations far apart in the latent space, it must be able
to – in some way – uncover or understand the underlying clean signal structure.
We hypothesize that this latent space information, therefore, should also be useful
in the reconstruction task required by the DAE.
Training this pipeline is not a trivial task, and we propose other options
in the Future Works section. In this project, we first train \(Enc_{CL}\) solely
via contrastive loss. Afterwards, we freeze that encoder, and train the
\(Enc_{REC}\) - \(Dec_{REC}\) pair using traditional MSE loss with the clean signal as target.
More specifically, we have two phases for our training:
Phase 1 - CL: Sample two clean signals from our dataset;
randomly augment each of them \(n\) times; apply the ST-FT to get their amplitude
spectrograms; pass them through the Enc_CL; then maximize the cosine-similarity
between encodings from the same clean signal and minimize the cosine-similarity between
encodings from the different signals. This is done using the equivalent formulation
of the InfoNCE loss objective.
For our dataset of clean signals, we used PyTorch’s Librispeech library [17].
For training, we used the “train-clean-100” partition (~7 GB), which contains around
100 hours of clean speech from 4 different speakers. For the validation/test test, we
used the “clean-dev” partition (~.7 GB), which contains 5 hours of clean speech. We
randomly split the data into ~3-second intervals. The data was saved as waveforms with
sample_rate = 16 kHz
To convert the waveforms into spectrograms, we used the ST-FT using a
Hann window with size 510, and hop_length = 256 [6]. These parameters were chosen
such that we would get perfect reconstruction when applying the inverse ST-FT to an
uncorrupted, clean spectrogram to recover the original waveform; while still producing
spectrograms that are reasonably sized (256 x 198). We padded the spectrograms to size
(256, 256) before feeding them into the network.
Our results demonstrate more consistent training with a lower final loss,
compared to traditional denoising architectures. This validates our hypothesis that
including the latent space information from the contrastive loss-based encoder improves
the accuracy of the model when performing reconstruction. Moreover, the training procedure
is much more stable when we first train w/ the contrastive loss.
To further validate our results that contrastive learning is useful in the learing process,
we inspect the mean of the absolute values of the weights (MAW) in the convolutional layer at the
bottleneck just before decoding. We compare the weights associated w/ the \(Enc_{REC}\) output and \(Enc_{CL}\) output. We note
that the mean weights are pretty comparable, which indicates the model is using information from both encoders
in the decoding process, across all noise augmentations.
| Noise Used | MAW from \(Enc_{REC}\) | MAW from \(Enc_{CL}\) |
|---|---|---|
| G-01 | 0.03 | 0.02 |
| G-1 | 0.04 | 0.02 |
| G-3 | 0.03 | 0.02 |
| G-5 | 0.03 | 0.04 |
| E-10 | 0.03 | 0.02 |
| E-50 | 0.03 | 0.02 |
| E-70 | 0.03 | 0.02 |
| E-100 | 0.02 | 0.02 |
This research tackles the problem of voice isolation, i.e. extracting a clean audio signal from a noisy environment. We tested the novel thesis that combining the latent space representations from a contrastive-learning framework and from a traditional reconstruction framework allows for a more stable training procedure and more accurate final models. We prove this experimentally by demonstrating the effectiveness of our methodology on a variety of noise augmentations – Gaussian & Environmental – over a variety of different amplitudes.
However, there are still many future directions we hope to explore: