[Paper Review] Video inpainting by jointly learning temporal structure and spatial details

3 minute read

Video inpainting by jointly learning temporal structure and spatial details

Wang, Chuan, et al. “Video inpainting by jointly learning temporal structure and spatial details.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. 2019.

Abstract

video frames의 잃어버린 부분을 복원하기 위한 새로운 data-driven video inpainting method 제시
새로운 deep learning 구조로 2가지 sub-networks로 이루어짐

The temporal structure inference network

3D fully convolutional 구조로 이루어져 있고, 높은 계산 비용을 감안하여 low-resolution video volume을 완성하는 것을 학습함

The spatial detail recovering network

low-resolution 결과를 통해 temporal guidance를 제공받고, 2D fully convolutional network를 사용하여 image 기반 inpainting을 수행

original resolution으로 복원된 video frames을 만들어냄

2-step network 구조는 각 frame의 spatial quality과 frame에 걸친 temporal coherence를 모두 보장

2개의 sub-networks를 함께 end-to-end 방식으로 학습을 진행

3개의 datasets을 통해 양적, 질적 평가를 진행하였고, 이전의 learning based methods의 성능을 능가

Introduction

holes이 존재하는 image나 video가 주어지면, inpainting 방식들은 보기에 자연스러운 결과를 생성하기 위해 잃어버린 video content를 복원하기 위해 노력함

object removal에 의해 생성된 holes

inpainting 기술에 요구되는 2가지 조건

1) 잃어버린 부분에 생성된 content는 주변 content와 의미론적으로 정확해야 함 (semantically correct)

2) 원래 holes을 알아볼 수 없도록 매끄럽게 채워야 함

본 논문은 image inpainting에 temporal dimension을 추가한 video inpainting에 초점을 맞춤

1) 잃어버린 video content를 복원시키는 것은 각 frame의 spatial context 뿐만 아니라 frames 간 motion context도 필요로 함

2) output video는 high spatio-temporal consistency를 유지해야 함 (global context-level, local image-feature-level)

2D를 3D로 확장하기 위한 많은 시도가 있었으나, 이는 challenging하고 한계점이 존재

새로운 end-to-end deep learning 구조를 제시

memoryblock

network는 temporal structure prediction sub-network와 spatial detail recovering sub-network로 구성

1) Temporal structure prediction sub-network

3D volume으로 video를 처리하고, input으로 downsampled video를 취함

Encoder-Decoder 구조로 3D CNN을 이용하여 holes을 채움

output volume을 temporal structure guidance로 사용 (spatial details은 부족하지만 motion 정보를 담고 있기 때문)

2) Spatial detail recovering sub-network

input : original video, temporal structure guidance

완성된 비디오 프레임을 원래 resolution으로 생성함

global, local l1 consistency losses를 갖는 2D Encoder-Decoder 구조

2개의 sub-network는 공동으로 학습되고 서로 도움을 줌

temporal structure guidance는 최종 video의 temporal smoothness와 context consistency를 향상시킴

또한, spatial detail recovering network가 backpropagation 과정을 통해 첫번째 network로 들어가면서 정확도를 향상시킴

Contributions

1) video completion 문제를 해결하기 위한 최초의 deep neural networks 제안

기존의 방식들과 비교하였을 때, 제안한 알고리즘은 복잡한 appearances와 missing region이 큰 video를 더 잘 다룰 수 있음

2) 새로운 deep learning 구조를 디자인

temporal structure 예측을 위한 3D CNN과 spatial detail을 복원시키기 위한 2D CNN으로 구성

3) 2개의 sub-networks를 공동으로 학습

전체 system의 성능 향상

Algorithm

input : incomplete video V_in(FxHxW), mask video M

output : complete video V_out

V_in, M, V_out 모두 동일한 사이즈

training 단계에서 V_in은 complete video에 random holes을 발생시켜서 얻어짐

3D completion network (3DCN)

3D CNN을 이용하여 V_in과 M의 down-sampled version으로부터 temporal structure를 예측

3D-2D combined completion network (CombCN)

2D CNN을 frame 별로 적용시킴

input : V_in, M (incomplete, high-resolution) complete frame I (low-resolution, 3DCN의 output)

paper setting : F=32, H=128, W=128, r=2

memoryblock

Temporal structure inference by 3DCN

3D CNN을 global하게 video inpainting에 적용하였고, 계산 비용 문제로 인해 down-sampled version을 입력으로 취함
3D completion network는 inpainted video V를 생성하고, 이는 개별 frame 별 details은 부족하지만 원래 video의 temporal structure를 잘 담고 있음
encoder-decoder 구조이고 총 12 layer로 구성되어 있음
입력으로 incomplete video와 mask가 주어지면, 먼저 4개의 strided convolution layers를 통해 latent space로 encoding

temporal-spatial structure를 capturing

다음으로, large perception field에서 spatial-temporal 정보를 capture하기 위한 3개의 dilated convolution layers를 거침

rate는 2,4,8로 설정

최종적으로 3개의 convolutional layers와 2개의 fractionally-strided convolutional layers로 들어감

missing part를 채움

전체 픽셀을 고려하기 위해 max-pooling과 upsampling layer를 사용하지 않고 stride가 2인 3x3 convolution layer를 사용
마지막 layer를 제외한 모든 convolutional layer는 BN과 ReLU를 수행하고, paddings을 통해 input과 output의 size를 맞춰줌
skip-connections을 적용하여 encoder와 decoder 간 feature mixture를 용이하게 함

Training

binary mask M은 filling region이면 1, 아니면 0으로 설정
l1 norm을 이용하여 loss 최소화

memoryblock

Spatial details inference by CombCN

Share on

Twitter Facebook LinkedIn

Hyeyoon Kang

[Paper Review] Video inpainting by jointly learning temporal structure and spatial details

Video inpainting by jointly learning temporal structure and spatial details

Abstract

Introduction

Algorithm

Temporal structure inference by 3DCN

Training

Spatial details inference by CombCN

Share on

Leave a comment

You may also enjoy

[Paper Review] Language models are few-shot learners

[Paper Review] Language models are unsupervised multitask learners

[Paper Review] Exploring the limits of transfer learning with a unified text-to-text transformer

[Paper Review] Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension