The U-Net architecture initially developed for Biomedical Image Segmentation has found consistent success in being adapted to various additional tasks. It was also used as the backbone of the PCNet architecture in the Self-Supervised Scene De-Occlusion Paper which I discussed in my earlier blog. As a sequel to that blog, I share a few of my observations about the U-Net architecture.
The main advances of the U-Net architecture are better localization on the end output and its speed. The output of the network is a labeled segmentation mask, as shown in the image (c) below, as taken from the paper. Labeled segmentation essentially stands for assigning a class label to each pixel.
Below is an image from the U-Net paper showing the left half convolutional layers, the right half de-convolutional layers, and the residual connections which occur between corresponding levels of the ‘U’ architecture. The distinctive ‘U’ architecture occurs because of the greater number of layers dedicated to the de-convolutional steps than many other architectures (in many networks, the vast majority of layers are dedicated to convolutional layers). It is the additional de-convolutional layers which, along with the residual connections, lead to better localization in the segmentation mask output. The specific layers are fairly typical of a traditional CNN, so I will not address those here, but they can be found in the Network Architecture section of the paper.
The U-Net paper-primarily shows that the de-convolutional task is non-trivial and requires a significant number of layers to produce useful masking of the original image.
Additionally, the residual connections allow for greater access to information that is likely useful for the de-convolution task, both increasing the accuracy of the resulting de-convolution and freeing up the main bottleneck (at the base of the U) to encode more meaningful information.
Furthermore, the paper shows that this architecture performed well with very little training data available (they relied heavily on typical augmentation techniques, specifically elastic deformations), which is clearly a useful feature for practical business applications.
Wrapping up Thoughts
Overall, the U-Net architecture was a clear choice for the Self-supervised Scene De-Occlusion paper. It is an excellent architecture for cases where output segmentation masks are most helpful than more traditional computer vision classification techniques such as bounding box.