One-step Alteration

$\eta_{x}^{t} \leftarrow \mathbb{E}_{\mathcal{T}}\left[\mathcal{F}_{\theta^{t}}(\mathcal{T}(x))\right]$

SimSiam can be approximated by one-step alternation between (7) and (8).

$\eta_{x}^{t} \leftarrow \mathcal{F}_{\theta^{t}}\left(\mathcal{T}^{\prime}(x)\right)$

Inserting it into the sub-problem (7), we have:

$\theta^{t+1} \leftarrow \arg \min _{\theta} \mathbb{E}_{x, \mathcal{T}}\left[\left\|\mathcal{F}_{\theta}(\mathcal{T}(x))-\mathcal{F}_{\theta^{t}}\left(\mathcal{T}^{\prime}(x)\right)\right\|_{2}^{2}\right]$

Now $\theta_t$ is a constant in this sub-problem, and $\mathcal{T}^\prime$ implies another view due to its random nature. => a Siamese network with stop-gradient

Exploring Simple Siamese Representation Learning

Background: Siamese Network

Applications

Problem

Strategies for Preventing Siamese Networks from Collapsing

SimSiam: Overview

SimSiam: Loss

SimSiam: Stop-gradient

SimSiam: Algorithm

Baseline Settings

Optimizer

Baseline Settings

Experimental Setup

Empirical Study: Stop-gradient

Empirical Study: Predictor

Empirical Study: Predictor

Batch Size

Batch Normalization

Similarity Function

Symmetrization

Hypothesis: Formulation

Hypothesis: Formulation

Hypothesis: Formulation

Solving for $\theta$

Solving for $\eta$

One-step Alteration

Predictor

Symmetrization

Multi-step Alternation

Expectation over Augmentations

Discussion (*)

Comparisons

Result Comparisons: ImageNet

Result Comparisons: Transfer Learning

Methodology Comparisons

Methodology Comparisons: Relation to SimCLR

Methodology Comparisons: Relation to SwAV

Methodology Comparisons: Relation to BYOL

Summary

Exploring Simple Siamese Representation Learning

Background: Siamese Network

Applications

Problem

Strategies for Preventing Siamese Networks from Collapsing

SimSiam: Overview

SimSiam: Loss

SimSiam: Stop-gradient

SimSiam: Algorithm

Baseline Settings

Optimizer

Baseline Settings

Experimental Setup

Empirical Study: Stop-gradient

Empirical Study: Predictor

Empirical Study: Predictor

Batch Size

Batch Normalization

Similarity Function

Symmetrization

Hypothesis: Formulation

Hypothesis: Formulation

Hypothesis: Formulation

Solving for θ\thetaθ

Solving for η\etaη

One-step Alteration

Predictor

Symmetrization

Multi-step Alternation

Expectation over Augmentations

Discussion (*)

Comparisons

Result Comparisons: ImageNet

Result Comparisons: Transfer Learning

Methodology Comparisons

Methodology Comparisons: Relation to SimCLR

Methodology Comparisons: Relation to SwAV

Methodology Comparisons: Relation to BYOL

Summary

Solving for $\theta$

Solving for $\eta$