CNN-Based Single-Image Super-Resolution: A Comparative Study

The following article is a brief report on my Bachelor’s project, a comparative study regarding CNN-Based Single-Image Super-Resolution (SISR) techniques. We’ll begin with a brief look into the first applications of Deep Learning in SISR, followed by a discussion regarding the state-of-the-art CNN-based models for this task. We’ll be comparing the performance of the models on an open-source dataset of satellite images concerning their output quality, speed, and resource demands. You can find the scripts for our experiments as well as the trained super-resolution models here.

Learning in Single-Image Super-Resolution

When the scale factor between the HR image and its LR counterpart surpasses 2, this curve-fitting process results in very smooth images, devoid of sharp edges and sometimes, with “artifacts.” This is because an interpolation technique is not, in fact, adding any new information to the signal. This can be changed with the aid of learning-based techniques. The idea is that if a neural network is provided with enough LR-HR pairs, it will be able to recognize the obscured entities in the LR images and reconstruct them based on the samples it’s seen during training. In other terms, the neural network will learn to “hallucinate” the details lost in the LR image. Deep convolutional neural networks are an obvious candidate for the job, given their outstanding success in image processing problems.

Early Attempts

Figure 1. The network structures of SRCNN and FSRCNN. The SRCNN network aims to learn an end-to-end mapping from an upsampled version of the input LR image (using the bicubic algorithm) to the HR target. Instead, FSRCNN introduces a deconvolution layer as the last layer of the network while replacing the non-linear mapping section with a three-step process: shrinking, mapping, and expanding. The authors refer to this as an “hour-glass” structure.

Super-resolution images generated by SRCNN and FSRCNN achieved higher Peak Signal-to-Noise Ratio (PSNR) values than the bicubic interpolation algorithm; e.g., in the famous Set5 dataset, the average PSNR for ×2 super-resolution is increased by around 4dB. Nevertheless, as the scale factor increases, the margin between the two grows smaller. Furthermore, studies have debunked the reliability of the PSNR metric [9] since then, and more recent publications also tend to investigate a more recently-introduced metric known as Structural Similarity (SSIM) [10]. We will be investigating both of these metrics in our study.

Methods

Residual Dense Network (RDN)

Figure 2. Residual Dense Blocks, the main building block of RDN.

The overall structure of the network is rather simple. The architecture of RDN is rather similar to FSRCNN, in the sense that it begins with convolution layers for feature extraction and ends with an upsampling operator (with learnable parameters) that maps the extracted features to the HR space. The upsampling operator in this network is inspired by [13]. In between the two, D RDB blocks are stacked on top of one another, the outputs of which are then concatenated and passed through a convolution layer with 1×1 filters.

Figure 3. The architecture of RDN.

The authors use the L₁ loss function and the Adam optimizer for training the network. To stabilize the training procedure, the learning rate is decreased exponentially every certain number of epochs. Later works have also adopted this setting. Despite its complicated architecture, RDN is the fastest one to train and is also among the top two with regard to inference time. Besides, RDN is one of the few models that I could train and evaluate on my notebook’s NVIDIA GTX 1660 Ti with only 6GB of VRAM. Yet, RDN is the second largest model with respect to learnable parameter counts, which speaks volumes about its efficiency.

Residual Channel Attention Network (RCAN)

Figure 4. Channel Attention mechanism introduced in [4].

At the start of the CA mechanism, a global average pooling (GP) is applied to the input, transforming it into a 1×1×C tensor, which is then passed through a convolution layer (D), a deconvolution layer (U), and the logistic sigmoid activation function (f).

Figure 5. The structure of RCABs.

This mechanism is then embedded in Residual Channel Attention Blocks (RCABs; Figure 5), which are the primary building block of RCAN, as illustrated in Figure 6.

Similar to RDN, the upscaling module in RCAN is inspired by the ideas introduced in [13].

Figure 6. The architecture of RCAN.

The computation cost of the attention mechanism in this network significantly increases the model's resource demand, making it one of the hardest ones to work with on this list. Another noteworthy point regarding this model is that RCAN is the only model on this list preferred using the Stochastic Gradient Descent to the Adam optimizer. However, the authors still use learning rate scheduling with similar settings for more stable training.

Super-Resolution Feedback Network (SRFBN)

Figure 7. The structure of FB.

This structure is then put into the architecture illustrated in Figure 8. An important feature of the SRFBN model, apparent from the figure, is that rather than learning an end-to-end mapping from the LR space to the HR space, the convolutional layers are tasked with predicting the residual error between the HR image and a copy of the LR input which has been upsampled by the bicubic interpolation algorithm. This technique, which is referred to as residual learning, has been studied in some previous works (earliest being [14]) and has shown to considerably speed up the model's convergence, reaching satisfactory quality in far fewer epochs than end-to-end models.

Figure 8. The structure of SRFBN (to the left) and unfolded (to the right). Layers labeled as LRFB are the LR Feature extraction Block, and the couple labeled as RB is the Reconstruction Block, which maps the extracted features into an RGB image.

Thanks to the feedback mechanism, the network is able to achieve qualities very close to the other methods but with the least number of learnable parameters. I trained this model on my notebook, but the evaluation script is rather excessive in memory consumption (to avoid disk I/O operations, as far as I understood). Inference through this technique is also rather quick, usually as fast as RDN.

Densely Residual Laplacian Network (DRLN)

Figure 9. The structure of the Laplacian Attention mechanism. The subscripts in the Laplacian Pyramid denote the dilation factor of the respective convolution layers.

The architecture of this model is illustrated in Figure 10. The architecture is similar to that of RCAN, with short residual connections bypassing several consecutive building blocks and one long connection, connecting the features extracted from the LR image to the groups' final output. However, DRLN applies its attention mechanism less frequently than RCAN and instead adds several convolution layers after its smaller building blocks, which they refer to as Dense Residual Laplacian Modules (DRLMs).

Figure 10. The structure of DRLN. In this figure, residual connections are referred to as skip connections.

The authors certainly achieve their goal of pushing the model to its limits and double the number of learnable parameters while also reducing the time needed for both training and inference. However, this complicated model does not improve the quality of the output images (at least for our dataset) and only performs as well as RCAN.

Gated Multiple Feedback Network (GMFN)

Figure 11. The structure of GMFN, GFM, RDB (identical to the one introduced in RDNs), and the reconstruction block (identical to the final layers in the SRFBN model).

The authors' changes made push the quality of the SR images generated by GMFN significantly above SRFBN while also reducing the runtime of inference, with a small penalty to the training time. Moreover, the GMFN model manages to surpass the RDN model in quality with fewer RDBs. In our experiments, the margin between the quality of the interpolated images and the SR images reduces more slowly for GMFN compared to other models, hinting that GMFN is the superior technique in higher scale factors.

Cross-Scale Non-Local Network (CSNLN)

Figure 11. Non-local cross-scale similarities in an image.

The authors of [8] call on an older idea used by traditional methods for SISR, self-similarity, which can be excellently described using the example in Figure 11. In this image, several larger patches throughout the image are similar to the target patch we’re trying to upscale. These patches can serve as a guide in reconstructing the target patch in the SR image.

The authors introduce two new attention mechanisms, designed for exploiting self-similarity. In-Scale Non-Local Attention (ISNLA) is a computationally expensive attention mechanism that can expose self-similarities in the image on the same scale. Cross-Scale Non-Local Attention (CSNLA) is the more convoluted sibling of ISNLA, finding similar patches with different sizes. Figure 12 shows the structure of the CSNLA mechanism.

Figure 12. The structure of the CSNLA mechanism. The upper branch looks for the patches in the LR image corresponding to patches in the HR image, while the branches within the green box find the similarities.

The CSNLN model applies these attention mechanisms to features extracted by convolution layers and then combines them with the features themselves using an approach they refer to as mutual-projected fusion. This procedure is shown in Figure 13.

Figure 13. The mutual-projected fusion. The upsample operation, as well as the downsample operation, are implemented using convolutional layers. The lower input denoted as L is the features extracted by the convolution layers processed by the attention mechanisms. s refers to the scale factor for the SR transformation.

The attention mechanisms and the convolution layers previously discussed are first upscaled via a deconvolution layer, followed by a couple of ReLU-activated convolutional layers. These building blocks form a Self-Exemplars Mining (SEM) cell, illustrated in Figure 14, along with the overall architecture of CSNLN. CSNLN uses feedback connections similar to SRFBN, putting its main building block within one.

Figure 14. The unfolded structure of CSNLN. Features extracted with each iteration are collected and concatenated into the same tensor.

The convoluted attention mechanisms introduced in this work can give it an edge in reconstruction quality in lower scale factors. However, this slight edge comes at an incredible cost, increasing both the training time and inference time of the model by order of magnitude, even though the network itself is a moderate-size network regarding the number of learnable parameters has.

Performance Evaluation and Comparison

All models were trained and tested with the hyperparameters and optimization settings suggested in their source code repositories on an NVIDIA V100 Volta with 32GB of VRAM. Note that the implementations are in various versions of PyTorch and use different versions of the CUDA library, which might have a small effect on their speed. The reported inference times are the average values recorded when reconstructing the entire dataset. We use the implementation of the bicubic interpolation algorithm provided in the Scikit-Image library, which might not be the most efficient implementation. Still, the same can be said for the CNN-based models.

Table 1 shows the results of our experiments.

Table 1. Comparison of the CNN-Based Single-Image Super-Resolution techniques, regarding reconstruction quality, no. of learnable parameters, training time, and inference time. The best value for each metric has been bolded. VDSR, SRGAN, and EEGAN are not discussed in this article. Refer to [14], [16], and [15], respectively.

To summarize, it seems that the idea of feedback networks can greatly help out SISR models; these networks can keep up with models with far more learnable parameters and deliver the same quality of reconstruction. Furthermore, it seems that while attention mechanisms seem to be the superior technique with small scale factors, where we want to recover minor details, techniques without one often surpass these techniques in higher scale factors, rendering their computational cost unreasonable to bear. If you’d ask us, we’d go with RDN or GMFN, depending on whether we can endure a little longer training for faster inference.

[1] Dong, Chao, Chen Change Loy, Kaiming He, and Xiaoou Tang. “Image super-resolution using deep convolutional networks.” IEEE transactions on pattern analysis and machine intelligence 38, no. 2 (2015): 295–307.

[2] Dong, Chao, Chen Change Loy, and Xiaoou Tang. “Accelerating the super-resolution convolutional neural network.” In European conference on computer vision, pp. 391–407. Springer, Cham, 2016.

[3] Zhang, Yulun, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. “Residual dense network for image super-resolution.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2472–2481. 2018.

[4] Zhang, Yulun, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. “Image super-resolution using very deep residual channel attention networks.” In Proceedings of the European Conference on Computer Vision (ECCV), pp. 286–301. 2018.

[5] Li, Zhen, Jinglei Yang, Zheng Liu, Xiaomin Yang, Gwanggil Jeon, and Wei Wu. “Feedback network for image super-resolution.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3867–3876. 2019.

[6] Anwar, Saeed, and Nick Barnes. “Densely residual laplacian super-resolution.” IEEE Transactions on Pattern Analysis and Machine Intelligence (2020).

[7] Li, Qilei, Zhen Li, Lu Lu, Gwanggil Jeon, Kai Liu, and Xiaomin Yang. “Gated multiple feedback network for image super-resolution.” arXiv preprint arXiv:1907.04253 (2019).

[8] Mei, Yiqun, Yuchen Fan, Yuqian Zhou, Lichao Huang, Thomas S. Huang, and Honghui Shi. “Image super-resolution with cross-scale non-local attention and exhaustive self-exemplars mining.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5690–5699. 2020.

[9] Wang, Zhou, and Alan C. Bovik. “Mean squared error: Love it or leave it? A new look at signal fidelity measures.” IEEE signal processing magazine 26, no. 1 (2009): 98–117.

[10] Wang, Zhou, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. “Image quality assessment: from error visibility to structural similarity.” IEEE transactions on image processing 13, no. 4 (2004): 600–612.

[11] Lim, Bee, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. “Enhanced deep residual networks for single image super-resolution.” In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 136–144. 2017.

[13] Shi, Wenzhe, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P. Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1874–1883. 2016.

[14] Kim, Jiwon, Jung Kwon Lee, and Kyoung Mu Lee. “Accurate image super-resolution using very deep convolutional networks.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1646–1654. 2016.

[15] Jiang, Kui, Zhongyuan Wang, Peng Yi, Guangcheng Wang, Tao Lu, and Junjun Jiang. “Edge-enhanced GAN for remote sensing image superresolution.” IEEE Transactions on Geoscience and Remote Sensing 57, no. 8 (2019): 5799–5812.

[16] Ledig, Christian, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken et al. “Photo-realistic single image super-resolution using a generative adversarial network.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4681–4690. 2017.

Full-Stack Web Developer By Day, and a Deep Learning Researcher By Night!