This repository is an attempt at replicating some results presented in Irina Higgins et al.'s papers :
- "Early Visual Concept Learning with Unsupervised Deep Learning".
- "beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework"
In order to use DeepMind's "dSprites - Disentanglement testing Sprites dataset", you need to clone their repository and place it at the root of this one.
git clone https://github.com/deepmind/dsprites-dataset.git
In order to use the XYS-latent dataset, you need to :
- download it here
- extract it at the root of this repository's folder.
Using this dataset and the following hyperparameters :
- Number of latent variables : 10
- learning rate : 1e-5
- "Temperature" hyperparameter Beta : 4e0
- Number of layers of the decoder : 3
- Base depth of the convolution/deconvolution layers : 32
With regards to the reconstruction images, every pair of rows consists of the real images on the first row and the reconstructed images on the second one. With regards to the latent space sampling, each latent are equally sampled between in the range [-3,3].
Epoch | Reconstruction | Latent Space |
---|---|---|
1 | ||
10 | ||
30 | ||
50 | ||
80 |
While the X,Y coordinates and S scale latent variables are clearly disentangled and reconstructed, the Sh shape latent variables is far from being reconstructed, let alone disentangled.
Using this dataset and the following hyperparameters :
- Number of latent variables : 10
- learning rate : 1e-5
- "Temperature" hyperparameter Beta : 5e3
- Number of layers of the decoder : 5
- Base depth of the convolution/deconvolution layers : 32
- Stacked architecture : [x]
Considering one column, every three row contains :
- Full image.
- Right-eye patch extracted from the full image.
- Left-eye patch extracted from the full image.
Epoch | Reconstruction | Latent Space |
---|---|---|
1 | ||
10 | ||
30 | ||
70 | ||
100 |
The S-scale latent variable seems to have been clearly disentangled while the other two latent variables, X and Y coordinates of the gaze on the camera plane, seem to be requiring a finer level of details from the decoder to show good reconstructions. Further analysis show that those latent variables are also quite nicely disentangled eventhough it is difficult to see here.
I do not own any rights on some of the datasets that have been used and experienced with, namely :
-
DeepMind's "dSprites - Disentanglement testing Sprites dataset".
-
Yann Lecun's "MNIST", through its PyTorch's Dataset wrapper.