In contrast with computer vision, biological vision is characterized by an anisotropic sensor (The Retina) as well as the ability to move the eyesight to different locations in the visual scene through ocular saccades. To better understand how the human eye analyzes visual scenes, a bio-inspired artificial vision model was recently suggested by Daucé et al (2020) 1.The goal of this master’s internship would be to compare the results obtained by Daucé et al with some of the more classical attentional computer vision models like the Spatial transformer network 2 where the visual input undergoes a foveal deformation.
Taking a look at a few examples from the dataset:
-
Spatial Transformer: 2 convolutional layers in localization network (ConvNet), grid sampler without downscaling (28x28 pixels) → (affine transformations) = 6 parameters
-
Training for 160 epochs with SGD, learning rate of 0.01 without decay, Each 10 epochs, increment the shift standard deviation by 1 [0, 15].
Training statistics:
- Overall results: Central accuracy of 88% and general accuracy of 43%, compared to 84% and 34% in the generic what pathway, respectively.
Accuracy map comparaison with the generic what pathway from the paper with the same training parameters:
Spatial Transformer Network | Generic What pathway 1 |
---|---|
A test on a noisy dataset with a shift standard deviation = 7
Taking a look at a few examples:
- Spatial Transformer: 4 convolutional layers in localization network (ConvNet), grid sampler without downscaling (128x128 pixels) → (affine transformations) = 6 parameters
Training for 110 epochs with an initial learning rate of 0.01 that decays by a factor of 10 every 30 epochs, each 10 epochs increase the standard deviation of the eccentricity, last 20 epochs vary the contrast.
After transformation with a STN:
Performance when the contrast varies between 30-70% and the digit is shifted by 40 pixels (the maximum amount):
- Spatial Transformer: 4 convolutional layers in localization network (ConvNet), grid sampler with downscaling (28x28 pixels) → (attention) = 3 parameters
Training for 110 epochs with an initial learning rate of 0.01 that decays by a half every 10 epochs, each 10 epochs increase the standard deviation of the eccentricity, last 20 epochs vary the contrast.
After transformation with a ATN (STN parametrized for attention), the digit is shifted by 40 pixels to check if the network can catch it:
Performance when the contrast is 30 and the digit is shifted by 40 pixels (the maximum amount):
- Spatial Transformer: 2 fully-connected layers in localization network (FeedForward Net), grid sampler with downscaling (28x28 pixels) → (fixed attention) = 2 parameters
Polar-Logarithmic compression: the filters were placed on [theta=8, eccentricity=6, azimuth=16], on 768 dimensions, providing a compression of ~95%, the original what/where model had 2880 filters, with a lesser compression rate of ~83%.
Training for 110 epochs with an initial learning rate of 0.005 that decays by a half every 10 epochs, each 10 epochs increase the standard deviation of the eccentricity, last 20 epochs vary the contrast.
After transformation with a POLO-ATN, the digit is shifted by 40 pixels to check if the network can catch it:
Accuracy comparison of spatial transformer networks with the What/Where model, in function of contrast and eccentricity of the digit on the screen.