[Paper Review] Deep blind video decaptioning by temporal aggregation and recurrence

3 minute read

Kim, Dahun, et al. “Deep blind video decaptioning by temporal aggregation and recurrence.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.

Abstract

Blind video decaptioning : 자동적으로 text overlays(ex, 자막)를 제거하고, input masks 없이 text로 가려져있던 부분을 그려주는 것
deep learning 기반의 기존 방식 : 하나의 image로 처리하고, 대부분 corrupted pixels(text overlay)의 위치가 알려져 있다고 가정함

본 논문의 목표 : mask 정보없이 video sequences에서 자동적으로 text 제거를 수행하는 것

fast blind video decaptioning을 위한 단순하지만 효과적인 framework 제안

The encoder-decoder model

The encoder

input : multiple source frames

scene dynamics으로부터 visible pixel을 얻을 수 있음

The decoder

encoder로부터 나온 정보(hint)가 합쳐져서(aggregation) decoder의 입력으로 들어감

input frame에서 decoder output으로 residual connection을 적용시킴

network가 오직 corrupted regions에 집중할 수 있도록 함

ECCV chalearn 2018 LAP Inpainting Competition Track2(video decaptioning)에서 1위를 차지
또한, 하나의 recurrent feedback을 적용시키면서 model의 성능을 향상시킴

temporal coherence를 보장할뿐만 아니라 corrupted pixels 위치에 대한 강한 단서를 제공함

양적, 질적인 실험 모두에서 정확하고 temporal consistent video를 얻는 결과를 보여줌 (real time : 50+fps)

Introduction

시각적인 contents로 소비하기 이전에 잃어버리거나 오염된 data를 처리하는 것은 중요한 단계임
image와 video를 처리하는 많은 applications에서 그러한 불완전성(온전하지 못한 data)은 인간과 기계 모두에 대한 시각적 인식을 저하시킴

해결책 : denoising, restoration, super-resolution, inpainting

본 논문은 video decaptioning에 focus를 둠

real-world video restoration scenarios에 직접 적용할 수 있는 video inpainting tasks 중 하나임

다양한 언어의 미디어와 비디오 데이터에서, 텍스트 캡션이나 캡슐화된 광고가 빈번하게 존재함

이러한 text overlays는 visual attention을 떨어뜨리고, frames의 일부분을 가림

video에서 text overlays를 제거하고 가려진 부분을 inpainting 하는 것은 spatio-temporal context에 대한 이해가 필요함
하지만, video sequence를 처리하는 것은 높은 memory를 필요로 하고 추가적인 time dimension으로 인한 시간 복잡도를 유발함

video decaptioning을 처리하는 단순한 방식은 frame 별로 독립적으로 처리하는 것

단점 : video dynamics에서 얻을 수 있는 정보를 활용할 수 없음

대부분의 자막이 있는 videos에서 자막으로 가려진 부분은 인접 frames에서 종종 발견할 수 있음
single frame으로 처리하는 것은 temporal consistency를 고려하지 않기 때문에, 복원된 video에서 연속적인 frames은 자연스럽지 못할 수 있음
video decaptioning에 존재하는 challenge : visual semantics에 독립적으로 자막이 갑자기 사라지거나 바뀌기 때문에, temporal stability를 유지하기 어려움

자동적인 text removal에 존재하는 challenge : corrupted pixels에 대한 binary indicator(inpainting mask)는 사전에 주어지지 않음

대부분의 기존 inpainting 방법[22,31,34]은 보통 mask를 사용할 수 있다고 가정하고 이를 기반으로 다른 image priors를 사용함

video에서 모든 frame에 대해서 pixel masks를 annotation 하는 것은 실용적이지 못하고 제한적임

단순하지만 효과적인 encoder-decoder model을 제안

memoryblock

encoder : 인접 frames으로부터 spatio-temporal context를 융합시킴(aggregation)
decoder : target frame을 reconstruction
residual learning 알고리즘을 적용

corrupted pixels에만 집중할 수 있도록 함

recurrent feedback connection

이전에 생성된 frame에 기반하여 현재 frame을 생성

인접 frames과 이전 output에서의 features는 corrupted regions에서 매우 다르기 때문에, model이 corrupted pixel을 더 잘 찾도록 도와주고, 성능을 향상 시킴
Loss function : gradient reconstruction loss, structural similarity loss
blind video inpainting에 최초로 deep learning을 적용

Contribution

1) 기존 image/video inpainting 방식들과 달리, inpainting masks를 필요로 하지 않는 encoder-decoder model

2) video decaptioning을 위한 효과적이고 강력한 loss function 사용

광범위한 실험을 통해 loss terms의 효과를 경험적으로 확인

3) 다른 방식의 성능을 뛰어넘었고, real-time으로 동작(50+ fps)

ECCV chalearn 2018 LAP Video Decaptioning challenge에서 1위를 차지

4) recurrence mechanism을 도입하여 성능 향상 (visual quality, temporal coherency)

Proposed Method

Video decaptioning : 자막과 노이즈가 있는 frames에서 original frames을 예측하는 것
다양한 인접 frames으로부터 hints를 모으고 target frame을 복원시킴

인접 frames에서 자막이 바뀌거나 object가 움직이면서 가려진 부분의 정보를 얻을 수 있음

recurrent feedback connection 사용

temporal flickering을 감소시키고, corrupted regions을 자동적으로 탐지함

3.1 Residual Learning

frame의 모든 pixel을 직접 예측하는 것은 불필요하게 uncorrupted pixes을 건드릴 수 있음
pixel indicators(inpainting mask) 없이 task를 다루기 위해, residual learning을 통해 학습을 진행

input center frame과 예측된 residual image를 합하여 final output을 생성 (pixel-wise 방식)

network가 명확하게 corrupted pixels에만 집중할 수 있도록 하고 global tone distortion을 방지함

memoryblock

t : frame index / N : temporal radius

3.2 Network Design

hybrid encoder-decoder model

encoder : 3D CNN, 2D CNN으로 구성

decoder : 일반적인 2D CNN 구조

network는 임의의 사이즈를 input으로 받을 수 있도록 fully convolutional 구조
Multiple time steps of our BVDNet framework

memoryblock

Two-stream encoder

인접 frames에서 hints를 얻고, 이전 생성 frame과 일관성을 유지시킴

1) encoder stream

인접 frames으로부터 spatio-temporal features를 직접 capture하는 3D convolutions로 구성

target frame을 복원하기 위해 필요한 the short-term video-level context를 학습 (N=2로 선택)

3D convolution layers를 지나면서 temporal dimension이 1로 감소

center input frame에서 text overlays를 제거하는 것이 목표

2) second stream

입력으로 이전 생성 frame(HxWx1xC)을 받는 2D CNN 구조

현재 생성 frame이 일시적으로 일관성을 유지할 수 있는 reference를 제공

encoded feature와 temporally-pooled one-frame feature를 결합 (element-wise summation)

Bottleneck and temporal-pooling skip connections

1) Bottleneck layer

encoder에는 bottleneck layers(dilated convolutions)가 뒤따름

2) temporal skip connections

3D encoder stream과 decoder 간에 skip connections을 적용시킴

Share on

Twitter Facebook LinkedIn

Hyeyoon Kang

[Paper Review] Deep blind video decaptioning by temporal aggregation and recurrence

Deep blind video decaptioning by temporal aggregation and recurrence

Abstract

The encoder-decoder model

Introduction

Contribution

Proposed Method

3.1 Residual Learning

3.2 Network Design

Share on

Leave a comment

You may also enjoy

[Paper Review] Language models are few-shot learners

[Paper Review] Language models are unsupervised multitask learners

[Paper Review] Exploring the limits of transfer learning with a unified text-to-text transformer

[Paper Review] Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension