[Paper Review] Selective Refinement Network for High Performance Face Detection

3 minute read

Chi, Cheng, et al. “Selective refinement network for high performance face detection.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. 2019.

Abstract

Selective Refinement Network(SRN) : 새로운 two-step classification과 regression 연산을 선택적으로 수행 (single-shot face detector)

false positives 감소, location 정확도 향상

The Selective Two-step Classification(STC) module : low-level detection layers에서 많은 simple negative anchors를 걸러냄

다음 classifier의 search space를 줄이기 위해

The Selective Two-step Regression(STR) module : high-level detection layers에서 anchors의 locations과 sizes를 조정

다음 regressor에 좋은 initialization을 제공하기 위해

Receptive Field Enhancement(RFE) block : 더 다양한 receptive field 제공

몇몇 극단적인 경우(extreme poses)에서 faces를 더 잘 예측할 수 있도록 도움

SRN detector는 SOTA 달성 : AFW, PASCAL face, FDDB, WIDER FACE datasets

Introduction

face detection 성능을 올리기 위해 다루어야 할 2가지 문제

1) recall efficiency : high recall rate에서 false positive의 수를 줄일 필요가 있음

memoryblock

RetinaNet의 그래프를 보면, recall rate가 90% 일 때, precision이 약 50% (low recall efficiency)

기존의 방식들은 high recall rate에 중점을 두고, 그로 인한 다량의 false positives 문제를 무시함

2) location accuracy : bounding box의 정확도가 향상될 필요가 있음

memoryblock

IoU threshold가 커질수록 AP의 값이 상당히 떨어짐

bounding box 정확도 향상을 위해 multi-step regression이 적용되었으나, 이는 face detection에서 역효과를 냄

Main contributions

1) STC module 제안 : classification search space를 줄이기 위해 low-level layers에서 많은 simple negative samples을 걸러냄

2) STR module 제안 : 다음 regressor에게 좋은 initialization을 제공하기 위해 high-level layers에서 anchors의 위치와 사이즈 조정

3) RFE module 도입 : extreme-pose-faces를 예측하기 위해 더 다양한 receptive fields를 제공

4) AFW, PASCAL face, FDDB, WIDER FACE datasets에서 SOTA 달성

Selective Refinement Network

memoryblock

Backbone : ResNet-50

6-level feature pyramid 구조

C2, C3, C4, C5 : 4개의 residual block으로부터 추출한 feature maps

C5, C6 : 2개의 단순한 down-sample 3x3 convolution layers에서 추출한 feature maps

P2, P3, P4, P5는 C2, C3, C4, C5와 연결된 구조 : bottom-top and top-down (FPN 구조)

P6, P7은 단순한 down-sample 3x3 convolution layers에서 추출한 feature maps

Dedicated Modules

The STC module : C2, C3, C4와 P2, P3, P4를 선택하여 수행

The STR module : C5, C6, C7와 P5, P6, P7를 선택하여 수행

The RFE module : classification과 location을 예측하는데 사용되는 features의 receptive field를 풍부하게 함

Anchor Design

모든 pyramid level에 2개의 anchors scale과 1개의 aspect ratio 사용

scale range : 8-362 pixels을 커버함

Loss function

deep architecture의 끝에 hybrid loss를 추가

focal loss와 smooth L1 loss의 장점을 둘 다 얻음

hard training examples에 더 focus를 두고 학습을 진행하며, 더 좋은 regression 결과를 학습

Selective Two-Step Classification

기존의 문제점 : anchor-based detector에서 작은 object를 detection하기 위해서는 작은 anchor를 image 전체에 tiling -> few positive, plenty of negative (class imbalance)

anchor-based face detector에서 classifier의 search space를 줄이기 위한 방식이 필요함

false positives를 줄이기 위한 two-step classification 방식

Selective Two-Step Classification

상위 pyramid level(P5,P6,P7)에서는 two-step classification이 불필요함

anchor의 수가 많지 않고 classification loss가 비교적 쉽기 때문에, 적용될 필요가 없고 적용시 계산비용을 늘어나게 함

하위 pyramid level(P2,P3,P4)에서는 two-step classification이 필요함

약 88.9% samples을 차지하며, 적절한 features(positive samples)가 부족

class imbalance 문제를 완화시키고 search space를 줄임

memoryblock

STC module은 positive/negative sample ratio를 증가시킴 (약 38배)

samples의 완전한 사용을 위해 각 step에 focal loss 사용

각 step에서 classification module 공유 (same task)

memoryblock

Selective Two-Step Regression

one-step regression의 문제점 : 최근 one-stage detector는 one-step regression 방식을 이용하는데, 이는 어려운 장면에서 부정확함

multi-step regression의 문제점 : cascade 구조에서 multi-step regression 방식을 이용하는 것은 오히려 역효과를 냄

Selective Two-Step Regression

low pyramid levels에서 two-step regression을 이용하면 성능이 저하됨

memoryblock

1) small anchors to perform two-step regression (너무 많은 small anchor)

2) 학습과정에서 network가 어려운 regression task에 집중하면서 큰 regression loss의 원인이 되고, 더 중요한 classification task를 방해함

3개의 higher pyramid levels에서 two-step regression 진행

3개의 higher pyramid levels에서 large face의 features를 충분히 활용하여 더 정확한 bounding boxes의 위치를 찾고, 3개의 lower pyramid levels이 classification task에 더 집중할 수 있도록 만듦

memoryblock

the smooth L1 loss 사용

Receptive Field Enhancement

많은 detection networks는 ResNet이나 VGGNet을 feature extraction module로 사용함

문제점 : 하나의 receptive field를 갖기 때문에, 다른 비율을 가진 object를 찾는 것이 힘듦

Receptive Field Enhancement (RFE) : classes와 위치를 예측하기 전에 feature의 receptive field를 다양하게 만들어 줌

memoryblock

Share on

Twitter Facebook LinkedIn

Hyeyoon Kang

[Paper Review] Selective Refinement Network for High Performance Face Detection

Selective Refinement Network for High Performance Face Detection

Share on

Leave a comment

You may also enjoy

[Paper Review] Language models are few-shot learners

[Paper Review] Language models are unsupervised multitask learners

[Paper Review] Exploring the limits of transfer learning with a unified text-to-text transformer

[Paper Review] Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension