Recent researches on pixel-wise semantic segmentation use deep neural networks to improve accuracy and speed of these networks in order to increase the efficiency in practical applications such as automatic driving. These approaches have used deep architecture to predict pixel tags, but the obtained results seem to be undesirable. The reason for these unacceptable results is mainly due to the existence of max pooling operators, which reduces the resolution of the feature maps. In this paper, we present a convolutional neural network composed of encoder-decoder segments based on successful SegNet network. The encoder section has a depth of 2, which in the first part has 5 convolutional layers, in which each layer has 64 filters with dimensions of 3×3. In the decoding section, the dimensions of the decoding filters are adjusted according to the convolutions used at each step of the encoding. So, at each step, 64 filters with the size of 3×3 are used for coding where the weights of these filters are adjusted by network training and adapted to the educational data. Due to having the low depth of 2, and the low number of parameters in proposed network, the speed and the accuracy improve compared to the popular networks such as SegNet and DeepLab. For the CamVid dataset, after a total of 60,000 iterations, we obtain the 91% for global accuracy, which indicates improvements in the efficiency of proposed method.
Manuscript profile