Tavant Logo

Manipulating Objects in an Image Through Self-Supervised Scene De-Occlusion

Share to

A well-recognized paper https://xiaohangzhan.github.io/projects/deocclusion/ from CVPR2020, introduces a complete framework for reproducing and recreating objects in a scene. It is a fascinating read, so in this article, we are providing commentary on the key aspects of this paper.

To keep this article short, here we mainly present the structure of the framework, and specifically, we do not cover how the involved convolutional networks


Scene de-occlusion decomposes an image, extracting cluttered objects in it into entities of individual intact objects.
Orders and positions of the extracted objects can be manipulated to recompose new scenes

(Partial Completion Networks, or PCNets) work. We will cover that in a subsequent article.

The framework presented in this paper trains two distinct convolutional networks, PCNets. Both convolutional networks have slightly altered UNet architectures, (https://arxiv.org/abs/1505.04597), whose output is at a pixel level. The PCNet-C (PCNet-Content) network uses a partial convolution for completing the unoccluded object. The images below are taken from the excellent video on the project’s page and available in the paper. The input for this framework is an image with appropriate bounding boxes around objects. From this input, we determine, in this sequence (numbering matches the image at the end of the post): the ordering of objects (1), the complete un-occluded shape of each object (4), and finally the complete un-occluded color and pattern of the object (6).

With the un-occluded object, we can perform a variety of tasks such as rearranging the objects in a photo while maintaining the correct object ordering and performing image inpainting on the background. The image inpainting step is not addressed here as the paper’s novelty is in their self-supervised object de-occlusion method.

Both PCNet networks are trained by creating random occluding shapes and overlaying them onto the image. By generating the occluding shapes, we can train the networks through self-supervision. The training of the PCNet-M (PCNet-Mask) network consists of placing a random shape either in-front of or behind the target object (as determined from the original bounding box). In both cases, the model is trained to predict the original target mask. The second case is meant for regularization and is necessary to prevent the model from always assuming an object is being occluded.

PCNet-C is trained to complete the portion of the target object that is occluded from the random occluding shape. Note that, as in the image below, the target object is not only occluded by the random shape but also a car (the black car in the bottom right of the image). Any attempt to account for this occlusion in the training process would require knowledge of object ordering and thus require a supervised framework. Although we disregard this cars occlusion, other non-occluded car objects in the dataset will allow our model to learn the true shape of a car.  Importantly, the authors found that this training procedure generalizes to cases where there are multiple occluders, perhaps not overlapping with the pastry in the final image below.

The framework procedure is described below (numbers match picture numbering):

  1. Recover the ordering of objects in the image using the PCNet-M
    We test for the ordering between two objects by selecting each as the ‘target’ and running the PCNet-M to find the amodal/un-occluded mask. If an object’s modal/occluded mask matches that of their amodal/un-occluded mask, they are not occluded by the other object.
  2. Retrieve all of the objects occluding(blocking) a given target object
  3. Generate two images for input into PCNet-M:
    1. A black and white image centered on the target object with the target, background, and union of occluding objects distinguished as in the picture below.
    2. A RBG image centered on the target object with the union of occluding objects greyed out.
  4. Use the PCNet-M to predict the amodal/un-occluded mask of the target object
  5. Generate two images for input into PCNet-C:
    1. A black and white image centered on the target object distinguishing between the target occluded object and the rest of the image.
    2. An RBG image centered on the target object with the difference of the amodal mask vs the modal mask greyed out.
  6. Use the trained PCNet-C to predict the amodal/un-occluded object.

Tags :

Let’s create new possibilities with technology