This project involves the construction of a convolutional neural network (CNN) to make an image classification model of the popular UTKFace dataset to classify images based on a person's gender, age, and race. You can find the requirements used for the virtual environment in the requirements.txt file and can install the commands using the following command, after cloning this repository and entering the directory:
pip install -r requirements.txt
A conda virtual environment was used with a Jupyter Notebook for this. You can see a sample of the predictions generated by the model in the below image:
I've taken some notes to get familiar with the use of CNNs and to gain an introductory understanding of their functionality, which you can also see below!
- say we want to identify swans in images
- there are characteristics that can help us find a swan in an image
- but for some, it may be more difficult to determine if a swan is present in a given image
- features may be present, but it can be difficult in some cases to find characteristic features
- we’ve detected features in images in a naïve way — these detectors were too general or too over-engineered
- we need to learn the features to detect
- we need a system that can do representation learning, a technique that allows a system to find the relevant features for a task, replacing manual feature engineering (using unsupervised learning via K-means, PCA, etc., or supervised learning via sup. dictionary learning, or neural networks)
- multilayer perceptrons, the traditional model, use only one perceptron for each input (e.g., a pixel in an image, multiplied by 3 for RGB) — the amount of weights to feed in becomes ridiculous, even for small images, making it difficult to train and can also cause overfitting
- another issue with MLPs is that they react differently to an input and its shifted version — ex: if a cat is in the top left of an image and then in the bottom right in another, the MLP will correct itself and assume that a cat will always appear in this section of the image
- we analyze the influence of nearby pixels using a filter
- we take a filter of a size specified by the user (3x3 or 5x5, usually by rule of thumb) and move across the image from the top left to the bottom right
- for each point on the image, a value is calculated based on the filter by using a convolution operation
- this reduces the number of weights that the NN must learn compared to an MLP, and also means that when the location of the features changes it doesn’t throw the NN off
- after the filters have passed over the image, a feature map is generated for each filter, which are then taken through an activation function to decide whether a certain feature is present at a given location in the image
- we can add more filtering layers and create more feature maps, and use pooling layers to select the largest values on the feature maps and use those as inputs into subsequent layers
- CNNs are also made of layers, which might not be fully connected — they have filters, sets of cube-shaped weights that are applied throughout the image
- each 2D slice of the filters are kernels, which introduce translation invariance and parameter sharing
- 3 types of layers in a CNN
- convolutional layer — layers where filters are applied to the original image, or to other feature maps in a deep CNN, where most of the user-specified parameters are in the network (most important parameters are the # of kernels and their size)
- action — apply filters to extract features, filters are composed of small kernels, one bias per filter, apply activation function on every value of the feature map
- parameters — number of kernels, size of kernels (width and height only, depth is defined by input cube), activation function, stride, padding, regularization type and value
- I/O — input: 3D cube, previous set of feature maps, output: 3D cube, one 2D map per filter
- pooling layer — perform a specific function, such as max pooling, which takes the max value in a certain filter region, or average pooling, which takes the average value in a filter region, used to reduce the dimensionality of the network
- action — aggregate info from final feature maps, generate final classification
- parameters — number of nodes, activation function: changes depending on the role of layer (if aggregating info, use ReLU; if producing final classification, use softmax)
- I/O — input: flattened 3D cube, previous set of feature maps, output: 3D cube, one 2D map per filter
- fully connected layer — placed before the classification output of a CNN and are used to flatten the results before classification, similar to the output layer of an MLP
- convolutional layer — layers where filters are applied to the original image, or to other feature maps in a deep CNN, where most of the user-specified parameters are in the network (most important parameters are the # of kernels and their size)
- what do CNN layers learn?
- each layer learns filters of increasing complexity
- first layers learn basic feature detection (edges, corners, etc.)
- middle layers learn filters that detect parts of objects (for faces, they might learn to respond to eyes, noses, etc.)
- last layers have higher representations: they learn to recognize full objects, in different shapes and positions
- CNNs are designed to process input images
- architecture is composed of 2 main blocks
- first block makes the particularity of this type of NN since it functions as a feature extractor
- performs template matching by applying convolution filtering operations
- first layer filters the image with several convolution kernels and returns feature maps, which are normalized with an activation function and/or resized
- this can be repeated several times — we filter feature maps obtained with new kernels, which gives us new feature maps to normalize and resize, which we can filter again, and so on and so forth
- the values of the last feature maps are concatenated into a vector, defining the output of the first block and input of the second one
- second block is at the end of all the neural networks used for classification
- input vector values are transformed (with several linear combinations and activation functions) to return a new vector to the output
- the last vector contains as many elements as there are classes
- element i represents the probability that the image belongs to class i — each element is therefore between 0 and 1, and the sum of all is 1
- these probabilities are then calculated by the last layer of the block, which uses a logistic function (binary classification) or a softmax function (multi-class classification) as an activation function
- the parameters of the layers are determined by gradient back-propagation — the cross-entropy is minimized during the training phase
- first block makes the particularity of this type of NN since it functions as a feature extractor
- ReLU correction layer
- ReLU — rectified linear units — refers to the real non-linear function defined by
ReLU(x) = max(0, x)
- replaces all negative values received as inputs by zeros
- acts as an activation function
- ReLU — rectified linear units — refers to the real non-linear function defined by
- parameterization of the layers
- layers in a CNN are stacked AND parameterized
- layers of convolution and pooling have indeed hyperparameters (parameters whose you must first define the value)
- size of the output feature maps of the convolution and pooling layers depends on the hyperparameters
- each image is W x H x D (width, heigh, and depth — 1 for a B and W image, 3 for RGB)
- the convolutional layer has 4 hyperparameters
- number of filters K
- size F filters (each filter is of dimensions F x F x D pixels)
- S step with which you drag the window corresponding to the filter on the image (e.x.: a step of 1 means moving the window one pixel at a time)
- Zero-padding P: add a black contour of P pixels thickness to the input image of the layers — without this contour, the exit dimensions are smaller
- the more convolutional layers are stacked with P = 0, the smaller the input image of the network is
- we lose info quickly, which makes the task of extracting features difficult
- pooling layer has 2 hyperparameters
- size F of the cells — the image is divided into square cells of size F x F pixels
- the S step — the cells are separated from each other by S pixels
- the filters are small and dragged on the image one pixel at a time, the zero-padding value is chosen so that the width and height of the input volume are not changed at the output
- for the pooling layer — F = 2 and S = 2 is a wise choice
- this eliminates 75% of the input pixels
- we can also choose F = 3 and S = 2 — which causes overlap
- choosing larger cells causes too much info loss and harms results