As an internal R&D project, we tried to segment free floor space, i.e. floor in which an agent could freely and safely roam. Such an AI can have many applications, from making autonomous vehicles and AGV safer, to public space utilization analysis, to disaster environment navigation, etc.
As an initial goal, we set the task to being able to segment the hallways in the building where our office is located (Otemachi biru) and to especially exclude safety areas around human beings from the free floor space.
We build this project on top of SegNet together with Mask RCNN, on the combined Hedau and SUNRGB-D 2015 dataset. We furthermore finetuned our models on in-house annotated images from the Otamachi building floors.
Below are a few valuable insights that we gained:
Fine-tuning quickly breaks pre-training distribution modelling capacity
Fine-tuning is definitely necessary to obtain good results, as the following images show:
(Person segmentation is not included before fine-tuning)
In our fine-tuning training and validation set, we only included hallway pictures of our building. In the test set however, there were a few pictures left of the inside of our office as well. Our models performed much better on these office pictures before fine-tuning.
This is most likely due to the fact that the office image is from a distribution much closer related to the pre-training data distribution as compared to the fine-tuning Otemachi hallway distribution.
This shows yet again how important it is to choose the correct dataset for a specified goal.
Two types of overfitting
Overfitting to either a training or validation set, and as a result performing poorly on a test set is obviously undesirable. However, if only a certain subset of a bigger distribution is relevant to the desired performance, overfitting to specifically that subset could be a good thing. Here we mean, overfitting as opposed to generalizing over the entire data distribution. Further training could still lead to overfitting to the specific training or validation set of said subset, which still has to be prevented.
This can be visualized as follows:
Let’s say we don’t have enough data to build a model that can handle the entire indoor scenery distribution. But we can build two models that perform well on the area within their pink outlines. By fine-tuning our model, we can make sure it performs well on Otemachi biru pictures, whenever that is the desired output. Hereby we ‘overfit’ to that specific subset, and our performance on the SUNRGBD distribution worsens.
If performance on both subsets is desired, this overfitting is not good, and some different learning approach has to be taken.
Combining dataset with different number of classes
More a trick than an insight, if we require 1 class (floor), but start training with 37 classes (SUNRGB-D dataset), we are performing unrequired learning. Reducing the 37 classes to the single floor class split however, might reduce performance. Our hypothesis is that training mutually on ceiling-, wall-, etc- segmentation, can help in understanding a more general representation of geometric information, boosting a model’s floor class performance. This is a common concept of multi-task learning, where each task is thought to increase the other tasks’ performances.
As such, we reduce the class dimension from 37 to 5 classes, which coincidentally allows us to include the Hedau dataset as well (and more data is better).
Transfer learning loss function
Another problem with transfer learning from multi-classes to single class, is that a Softmax CE loss function included in the pre-training can’t be applied to the fine-tuning, as we don’t have all the labels for the other 4 classes. (We don’t delete the last layer but instead keep the 5-class-dimensional output of the model intact).
In ‘False Negative’ and ‘True Positive’ labeling cases a Softmax can still be applied, in the other two cases (‘True Negative and False Positive’) however a Sigmoid will have to be used instead.
We reason that, it is acceptable, and better to lose information by just always applying a sigmoid, instead of losing time by including a per-pixel if loop to check in which cases we can use Softmax.