[Paper Review] Spatio-temporal filter adaptive network for video deblurring

4 minute read

Spatio-temporal filter adaptive network for video deblurring

Zhou, Shangchen, et al. “Spatio-temporal filter adaptive network for video deblurring.” Proceedings of the IEEE International Conference on Computer Vision. 2019.

Abstract

video deblurring challenge : camera 흔들림, object motions, depth variations 등등에 의한 spatially variant blur
기존 방식들은 optical flow를 이용하여 연속적인 frame 사이의 alignment를 수행하거나 blur kernels 예측함

문제점 : 예측된 optical flow가 정확하지 않을 경우, artifacts를 생성하거나 효율적으로 blur를 제거할 수 없음

분리된 optical flow 예측에서의 제한을 극복하기 위한 Spatio-Temporal Filter Adaptive Network(STFAN) 제안

통합된 framework에서 alignment와 deblurring 수행

STFAN

Input : 이전 frame의 blurry img, restored img 그리고 현재 frame의 blurry img (총 3개)

동적으로 alignment와 deblurring을 수행하기 위한 spatially adaptive filters를 생성

새로운 Filter Adaptive Convolutional(FAC) layer 제안

alignment : 현재 frame과 이전 frame(deblurred features) 사이 alignment 수행

deblurring : 현재 frame features에서 spatially variant blur를 제거

하나의 reconstruction network를 개발

선명한 frames을 얻기 위해 2개의 transformed features를 융합한 것을 input으로 받음 (align feature, deblur feature)

제안된 방식은 SOTA 성능 달성 (정확도, 속도, model size)

Introduction

Spatio-Temporal Filter Adaptive Network(STFAN) for video deblurring

element-wise filter adaptive convolutional(FAC) layer 제안

dynamic filter networks에서 영감 : input에 condition된 filter를 생성 [11]

memoryblock

기존 dynamic filter networks와 다르게 FAC layer는 생성된 spatially variant filters를 down-sampled features에 적용함

하나의 작은 filter size를 사용하여 large receptive field를 얻을 수 있게 함

feature channel 별로 동적으로 filter를 생성하기 때문에 강한 capability와 flexibility를 가짐

하나의 통합된 network에서 2개의 element-wise filter adaptive convolution processes로 alignment와 deblurring filter 생성

이전 frame의 blurry image, restored image와 현재 frame의 blurry image가 주어지면, 동적으로 feature transformation을 위한 alignment filter와 deblurring filter를 생성

기존의 방식과 다르게 deblurring filters를 예측하기 위해 풍부한 input을 취함

3개의 image와 motion 정보(alignment filters를 통해 얻어진 두 인접 frame 사이의 motion 정보)

FAC layer의 효과

명시적으로 optical flow를 예측하고 image를 warping 시키는 과정 없이, 다른 time steps에서 얻어진 features를 적응적으로 alignment 수행함

alignment 정확도의 tolerance

feature domain에서 deblurring과 함께 spatially variant blur를 더 잘 다룰 수 있음

memoryblock

The main contributions

1) FAC layer 제안 : 생성된 element-wise filters를 feature transformation에 적용

feature domain에서 alignment와 deblurring task 수행

2) video deblurring을 위한 새로운 spatio-temporal filter adaptive network(STFAN) 제안

명시적인 motion estimation(optical flow 예측) 없이 하나의 통합된 framework로 alignment와 deblurring 수행하고, 2개의 spatially variant convolution process 형성

3) 양적, 질적 모두 benchmark dataset에서 STFAN을 평가하였고, 정확도, 속도, model size 모든 면에서 SOTA임을 보여줌

Proposed Algorithm

3.1 Overview

memoryblock

일반적인 CNN 기반 video deblurring 방식들과 다르게, frame-recurrent 방식 제안 (이전의 frame과 현재 input의 정보를 필요로 함)

기존 CNN 방식 : 중간 frame을 shape한 image로 복원시키기 위해, input으로 연속적인 burry frames(3 or 5)을 취함

반복적인 특성으로 인해, 계산비용의 증가 없이 많은 수의 이전 frames의 정보를 이용할 수 있음
STFAN은 3개의 image로부터 alignment와 deblurring을 위한 filter를 생성

3개의 image : 이전 frame의 blurry image와 restored image 그리고 현재 frame의 blurry image

FAC layer를 사용하여 이전 time step에서의 deblurred features와 현재 frame feature의 alignment를 수행하고, 현재 blurry image에서 추출된 features로부터의 blur를 제거
최종적으로, reconstruction network가 2개의 transformed features를 융합시키면서 shape한 image로 복원시킴

3.2 Filter Adaptive Convolutional Layer

memoryblock

the filter adaptive convolutional(FAC) layer를 제안 : 생성된 element-wise convolutional filters를 features에 적용

Kernel Prediction Network에서 영감을 얻음 (RGB channel에 대해 같은 prediction)

공간적으로 변형된 task에 대해 더 유능하고 유연하게 대처하기 위해, 각 channel에 대해 다른 filter를 생성함
생성된 filter F의 dimension : h x w x ck2 (이론적으로는 5-dimension : h x w x c x k x k)

이를 5-dimension으로 reshaping하여 feature에 적용시킴

memoryblock

하나의 large receptive field는 large motions과 blurs를 다루는데 필수적임
기존의 방식과 대조적으로 제안된 network는 down-sampled features에 FAC layer를 적용하기 때문에 큰 filter size를 필요로 하지 않음

실험을 통해 k=5로 하였을 때 deblurring task에 충분한 성능을 내는 것을 확인함

3.3 Network Architecture

Feature Extraction Network

blurry image B에서 features E를 추출함

총 3개의 convolutional blocks으로 이루어져 있고, 각 convolution block은 stride=2인 하나의 convolutional layer와 LeakyReLU를 갖는 2개의 residual blocks으로 구성

추출된 features는 STFAN의 입력으로 들어감

Spatio-Temporal Filter Adaptive Network

3개의 module : encoder, alignment filter generator, deblurring filter generator

2개의 filter generator는 kernel size가 3인 하나의 convolution layer와 2개의 residual blocks으로 구성되어 있고, 그 뒤에 channel 확장을 위한 1x1 convolution 적용

1) The encoder : feature T 추출

kernel size가 3인 3개의 convolutional blocks으로 구성되어 있고, 각 block은 stride=2인 하나의 convolutional layer와 2개의 residual blocks으로 구성

2) The alignment filter generator : alignment를 위한 adaptive filters를 예측

input : encoder에서 추출된 feature T

생성된 filter는 풍부한 motion 정보를 포함하고 있고, dynamic scene에서 불균일한 blur의 처리를 도움

memoryblock

3) The deblurring filter generator : deblurring을 수행하기 위한 spatially variant filters를 생성

input : feature T와 alignment filters

memoryblock

2개의 생성된 filters와 함께, 2개의 FAC layers는 이전 time step에서의 deblurred features H_t-1와 현재 frame 사이 alignment를 수행하고 추출된 feature E에서의 blur를 제거함

다음으로 2개의 transformed features를 concatenation(C) 시키고, reconstruction network를 통해 shape image 복원

deblurred information H_t는 다음 iteration으로 보내짐

Reconstruction Network

input : STFAN에서 융합된 features(C_t)

scale convolutional block으로 구성

각 block은 하나의 deconvolution layer와 2개의 residual blocks으로 이루어짐

3.4 Loss function

1) Mean squared error(MSE) loss : restored frame R과 그에 해당하는 ground truth S 사이의 차이를 측정

memoryblock

2) Perceptual loss : R과 S의 VGG features 사이의 Euclidean distance 측정

memoryblock

The final loss function

memoryblock

Share on

Twitter Facebook LinkedIn

Hyeyoon Kang

[Paper Review] Spatio-temporal filter adaptive network for video deblurring

Spatio-temporal filter adaptive network for video deblurring

Abstract

Introduction

Spatio-Temporal Filter Adaptive Network(STFAN) for video deblurring

Proposed Algorithm

3.1 Overview

3.2 Filter Adaptive Convolutional Layer

3.3 Network Architecture

3.4 Loss function

Share on

Leave a comment

You may also enjoy

[Paper Review] Language models are few-shot learners

[Paper Review] Language models are unsupervised multitask learners

[Paper Review] Exploring the limits of transfer learning with a unified text-to-text transformer

[Paper Review] Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension