[Paper Review] Fast spatio-temporal residual network for video super-resolution

2 minute read

Fast spatio-temporal residual network for video super-resolution

Li, Sheng, et al. “Fast spatio-temporal residual network for video super-resolution.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.

Abstract

3-dimensional(3D) convolution : video에서 spatial 정보와 temporal 정보를 동시에 이용하기 위해 사용하는 일반적인 방식

계산 복잡도가 상당히 증가하면서 SR model의 depth에 제한이 생기고, 그로 인해 성능이 약화됨

video SR task에 3D convolutions을 적용한 fast spatio-temporal residual network(FSTRN) 제안

낮은 계산 비용을 유지하면서 성능을 향상

fast spatio-temporal residual block(FRB) 제안

3D filter를 2개의 3D filter로 분리하면서 상당히 낮은 dimensions을 가짐

cross-space residual learning 고안

LR space를 HR space에 직접 연결시켜 feature fusion과 up-scaling parts의 계산적인 부담을 줄임

광범위한 실험을 통해 SOTA 입증

Introduction

deep learning에서 video SR을 다루는 방식

1) frame 별로 single image SR을 수행

frame을 독립적으로 처리하면서 temporal 정보를 무시함 (temporal inconsistency)

2) temporal 정보를 추출하기 위해 temporal fusion 기술을 활용 (ex, motion compensation)

수동적으로 고안된 구조를 필요로 하고, 계산 비용을 많이 잡아먹음

3) 3-dimensional (3D) filters를 사용 (2D conv -> 3D conv)

너무 많은 parameter와 상당한 계산 복잡도를 가져옴 -> model의 depth에 제한이 생기면서 성능 약화

input LR video와 복원할 HR video는 상당히 비슷하기 때문에 residual connection이 SR network에서 흔하게 사용됨

기존 residual connection 적용 방식의 2가지 문제점

1) HR space에서 residual connection을 사용 (input : interpolated frame)

network의 계산 복잡도를 크게 증가시킴

2) LR space에서 residual connection을 적용

network의 끝 단 feature fusion과 upscaling stage에 부담을 줌 (추가적인 HR supervision 없이 오직 끝 단에서 학습)

위와 같은 문제점을 해결하기 위해, fast spatio-temporal residual network(FSTRN) 제안

The main contributions

1) fast spatio-temporal residual network(FSTRN) 제안

spatial 정보와 temporal 정보를 동시에 이용할 수 있고, temporal consistency를 얻고 spurious flickering artifacts의 문제를 완화시킴

2) fast spatio-temporal residual block(FRB) 제안

3D convolution을 2개의 3D convolution으로 분리시키면서 상당한 dimension 감소

deep neural network 구조로 성능을 향상 시키면서 상당한 계산 비용 감소

3) global residual learning(GRL) : LR space residual learning(LRL), cross-space residual learning(CRL)

상당한 성능 향상

memoryblock

Fast spatio-temporal residual network

3.1 Network structure

memoryblock

1) LFENet

LR videos에서 feature를 추출하기 위해 하나의 C3D 사용

memoryblock

F는 뒤에 LR space global residual learning(LR residual connection)을 위해 사용되며, FRBs의 input으로 들어감

2) FRBs

LFENet output에서 spatio-temporal features를 추출하기 위해 사용

memoryblock

이전 FRB output이 다음 FRB input으로 들어가는 구조

LR space에서의 feature learning을 개선하기 위해 FRBs와 함께 LR space residual learning(LRL)이 수행됨

memoryblock

3) LSRNet

LRL을 통한 효율적인 feature extraction 이후, HR space에서 복원된 video를 얻기 위해 사용

C3D (feature fusion) -> upscaling layer -> C3D (feature map channels tuning)

The network output

memoryblock

3.2 Fast spatio-temporal residual blocks

memoryblock

1) figure 1a

SRResNet에서의 residual block(BN 제거), single SR tasks에서 좋은 향상을 보임

2) figure 1b

residual blocks을 multi-frame에 적용시키기 위해 2D convolution을 3D convolution으로 대체

inflation problem : 많은 계산 비용 증가

3) figure 1c

fast spatio-temporal residual block(FRB) 제안 : C3D -> two step spatio-temporal C3Ds

kxkxk -> 1xkxk(spatial), kx1x1(temporal)

ReLU도 PReLU(negative part에 대한 기울기를 학습)로 변경

memoryblock

계산 비용이 상당히 감소

제한된 컴퓨팅 자원 하에 video SR에 직접 적용하는 더 큰 C3D-based model을 더 나은 성능으로 구축

3.3 Global residual learning

1) LR space residual learning (LRL)

LR space에서 FRBs와 함께 수행하며, 그 뒤에 PReLU와 dropout이 연결되어 있음

dropout layer를 통해 network의 generalization ability 향상

memoryblock

2) Cross-space residual learning (CRL)

단순하게 LR video를 HR space로 직접 SR mapping 수행

CRL이 interpolated LR을 output에 연결시켜 주면서, LSRNet의 계산적인 부담을 덜어줌

memoryblock

mapping function : 계산 비용이 크지 않게 가능한 단순한 함수로 선택

3.4 Network learning

L1 loss를 사용하여 학습을 진행

the Charbonnier penalty function을 사용

memoryblock

Share on

Twitter Facebook LinkedIn

Hyeyoon Kang

[Paper Review] Fast spatio-temporal residual network for video super-resolution

Fast spatio-temporal residual network for video super-resolution

Share on

Leave a comment

You may also enjoy

[Paper Review] Language models are few-shot learners

[Paper Review] Language models are unsupervised multitask learners

[Paper Review] Exploring the limits of transfer learning with a unified text-to-text transformer

[Paper Review] Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension