We assign a category label to every pixel in an image. Therefore, loss is incurred at every pixel
Input image : CxHxW, where C is the number of channels
Output Image : KxHxW, where K is the number of categories
Upsampling can be done by transpose convolutions. The use of word transpose becomes clear when we look at convolution operation as circulant matrix multiplication operation
max-unpooling operation
Object detection + Semantic segmentation
Need mask proposals from image.
Mask RCNN is built on top of Faster RCNN, where bounding box coordinates go through bilinear pooling to extract the region and pixel wise classification is done