Convolutional Neural Networks are probably the reason why the fields of machine learning, deep learning and AI are so popular today. They are awesome and what they are doing certainly seemed like a black magic a couple of years ago. At its core, we can find the convolution process. This process is used for making detecting features of the images and uses this information for classification. To be more precise, here is how the complete process of Convolutional Neural Networks looks like:
Convolution process detects features of the image using filters and stores them into feature maps. There is then further compressed by the pooling layers. This information is then flattened into 1D array. Finally, a feed-forward network is used for classification. This is all very cool, but there is one big problem with it. This type of network might have a problem of classifying the same object in images where the same object is taken from different angles. Or for example, consider the image below:
For a CNN, a simple presence of eyes, nose and mouth can be a very strong indicator that these images belong to the same category. Orientational and spatial relationships between these objects are not taken into consideration. This problem occurs because of the pooling layers that introduce invariance into the convolutional neural networks. A lot of engineers avoid using pooling layers in their networks because they lose information and doesn’t encode relative spatial information between the feature. That is the problem that Capsule networks are solving. In essence, they embed this information within the network, using pose. The pose is a technique borrowed from 3D graphics, where relationships between objects are presented with the combination of translation and rotation.
So what is a capsule and how does it work? The idea is simple, keep existing information about the probability of the feature, but add relative spatial information about the object as well. Capsules replace the traditional neuron and they model the probability and orientation of the feature with a vector. The length of the vector encodes the probability of the feature and direction of the vector encodes the orientation of the feature. So if the ‘eye’ feature in the previous example is moved along the picture, the probability stays the same, but its orientation changes. This means that the length of the output vector of the capsule will stay the same, but its direction will change.
We mentioned that Capsule replaces the traditional neuron. Neuron, as the main building block of neural networks, receives the scalar value as an input and produces scalar value as the output. If we break it into steps, we could describe the process of traditional values like this. First, each input scalar value is first multiplied with weight, which is also a scalar value. Then all these inputs to one neuron are summed together, resulting again in a scalar value. Finally, the activation function is applied to this value introducing non-linearity and scalar value as an output. So, not a lot of room to encode information about the orientation.
Capsules take a different approach and extend these operations on vectors. Since the output of each Capsule is a vector, the input value to other capsules is also vectors. This means that for the first step we would need to perform matrix multiplication of input vectors with weights, which are in this case matrices as well. In essence, these weights encode relative spatial information. These weights are not learned during the backpropagation but are defined using dynamic routing.
This routing defines where each capsule’s output goes. Then, these results are multiplied with a scalar value, which is used by the dynamic routing to define the importance of the capsule. We might observe this value as a weight from traditional neurons, but we should be careful with that notion. Now, all inputs can be summed. Finally, we need to introduce non-linearity in the input vector to output vector transformation. This is done using one more new concept – squashing. This is a novel activation function that takes a vector as an input, and then “squashes” it to have the length of no more than 1 without changing its direction.
For the purpose of this article, we consider the Capsule network that is used for the MNIST dataset. In general, every Capsule network is divided into two components: encoder and decoder. Each of these components has 3 layers. The encoder is constructed from:
1. Convolutional Layer – Uses 256 kernels with a size of 9x9x1 and stride 1, followed by ReLU activation.
2. PrimaryCaps Layer – This layer has 32 primary capsules w. These capsules generate combinations of features detected by the convolutional layer. The structure of these capsules is similar to the convolutional layers. Input with the size 20x20x256 is processed by eight 9x9x256 convolutional kernels with stride 2. As the output of this process, each capsule produces 6x6x8 output tensor.
3. DigitCap Layer – For each digit from the dataset one capsule is added in this layer. The input of the capsule is composed of 1152 input vectors making 6x6x8x32 input tensors. Each of these input vectors is multiplied by 8×16 weight matrix. The output of this layer are 10 sixteen-dimensional vectors.
Capsule Network Encoder – Source
The loss function can be represented with the formula:
where Tk has value 1 if the correct label corresponds with the digit of DigitCap layer and 0 otherwise.
The decoder is composed of three Fully connected layers used to decode the output of DigitCap back to the image. During this process, it uses Euclidean distance between the reconstructed image and the input image as a loss function.
Capsule Network Decoder – Source
In this article, we covered the basics of Capsule Networks and its mechanisms. If you want to learn more about them, check out this collection of links.
Author: Nikola Zivkovic, Rubik’s Code