Using the standard deviation and not directly the variance is crucial.
Indeed, if we take in the hinge function, the gradient of with respect to becomes close to when is close to . In this case, the gradient of also becomes close to and the embeddings collapse.
We define the covariance matrix of as:
Inspired by Barlow Twins [49], we can then define the covariance regularization term :
We finally define the invariance criterion between and as the mean-squared euclidean distance between each pair of vectors, without any normalization:
The overall objective function taken on all images over an unlabelled dataset is given by:
In this section, we evaluate the representations obtained after self-supervised pretraining of a ResNet50 backbone with VICReg during 1000 epochs, on the training set of ImageNet, using the training protocol described in section 4. We also pretrain and evaluate on the ESC-50 audio classification dataset.
Evaluation of the representations obtained with a ResNet-50 backbone pretrained with VICReg on:
Train a linear classifier on top of the frozen representations learnt by our pretrained ResNet-50 backbone on:
Evaluate on:
ESC-50 audio dataset: is an environmental sound classification dataset with 50 classes.
We jointly embedded a raw audio time-series representation on one branch, with its corresponding time-frequency representation on the other branch. The raw audio is processed by a 1-dimensional ResNet-18. We compute the mel spectrogram of the raw audio and process it with a standard ResNet-18.
Table 3 reports the performance of a linear classifier trained one the frozen representations obtained with VICReg and Barlow Twins to a simple supervised baseline where we train a ResNet-18 on the time-frequency representation in a supervised way.
Current best approaches that report around accuracy on this task uses tricks such as heavy data augmentation or pretraining on larger audio and video datasets.
With this experiment, our purpose is not to push the state of the art on ESC-50, but merely to demonstrate the applicability of VICReg to settings with multiple architectures and input modalities.
In this section we study how the different components of our method contribute to its performance, as well as how they interact with components from other self-supervised methods.
All reported results are obtained on the linear evaluation protocol, using a ResNet-50 backbone if not mentioned otherwise, and 100 epochs of pretraining, which gives results consistent with those obtained with 1000 epochs of pretraining.
VICReg has several unique properties:
The ability of VICReg to function with different parameters, architectures, and input modalities for the branches widens the applicability to joint-embedding SSL to many applications, including multi-modal signals.
VICReg uses the same decorrelation mechanism as Barlow Twins.
The variance term of VICReg allows us to get rid of standardization
There is an undesirable phenomenon happening in Barlow Twins, the embeddings before standardization can shrink and become constant to numerical precision, which could cause numerical instabilities and is solved by adding a constant scalar in the denominator of standardization.
Whitening operation of W-MSE consists in computing the inverse covariance matrix of the projections and use its square root as a whitening operator on the projections.
Contrastive and clustering based self-supervised algorithms rely on direct comparisons between elements of negative pairs.
VICReg is a simple approach to self-supervised image representation learning with three regularizations:
VICReg achieves results on par with the state of the art on many downstream tasks, pushing forward the boundaries of non-contrastive self-supervised learning.
Limitations: The time and memory costs of VICReg are dominated by the computation of the covariance matrix for each processed batch, which is quadratic in the dimension of the embeddings.