[Paper Review] You only look once: Unified, real-time object detection

3 minute read

You only look once: Unified, real-time object detection

Redmon, Joseph, et al. “You only look once: Unified, real-time object detection.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

Abstract

새로운 object detection network인 YOLO 제안 (one-stage detector)

기존의 방식(classifiers로 detection 수행)과 다르게, 하나의 regression problem으로 처리 (공간적으로 구분된 bounding boxes와 그에 해당하는 class probabilities)

single neural network로 전체 image로부터 bounding boxes와 class probabilities 예측

end-to-end 방식

통합된 구조의 YOLO는 상당히 빠른 속도가 특징

초당 45 frames 처리, real-time 가능

Fast YOLO : 초당 155 frames 처리할 정도로 매우 빠르면서 다른 real-time 방식들에 비해 mAP는 2배 이상 높음
기존 SOTA 방식들과 비교하였을 때, localization error는 더 높지만, 배경에 대한 false positives는 적음
YOLO는 일반적인 objects의 representations을 학습

자연 이미지에서 아트워크와 같은 다른 영역으로 일반화할 때, DPM과 R-CNN을 포함한 다른 방식들을 능가

memoryblock

Introduction

현재 detection systems은 detection을 classifiers로 수행함 (deformable parts models, DPM)

하나의 object를 detection하기 위해 하나의 classifier가 필요

전체 image에 대하여 sliding window 방식으로 진행됨

R-CNN 방식은 region proposal 방식 사용

먼저, image에서 bounding boxes 후보를 생성하고(selective search), 그런 boxes에 대하여 하나의 classifier 수행

classification을 수행하고 나면, bounding boxes를 refinement 시키기 위해 post-processing을 수행하고, 비슷한 detection proposal을 제거

=> R-CNN 한계 : 복잡한 pipelines으로 속도가 느리고, optimization이 힘듦 (two-stage로 각 component가 분리되어 학습을 진행하기 때문)

memoryblock

YOLO : image pixels에서 bounding box coordinates와 class probabilities를 직접 예측

image에 어떤 object가 있는지와 그 위치를 예측하기 위해 오직 한번만 보면 됨 (You Only Look Once)

memoryblock

YOLO 동작 방식 : single convolutional network를 이용하여 다양한 bounding box와 그에 대한 class probability를 동시에 예측
YOLO의 장점

1) 복잡하지 않은 pipeline으로 상당히 빠른 속도

2) 예측을 할 때, image에 대해 global하게 판단함

sliding window 와 region proposal 방식들과 다르게, YOLO는 training과 testing time에 전체 image를 보면서 classes에 대한 contextual 정보를 얻을 수 있음

YOLO는 Fast-RCNN 방식과 비교했을 때, background errors의 수가 절반도 되지 않음

3) 일반적인 objects의 representations을 학습함

새로운 domian이나 예상치 못한 input이 들어왔을 때, 더 잘 다룰 수 있음

YOLO는 SOTA와 비교하였을 때 정확도는 떨어지지만, 속도가 빠르고 작은 물체를 잘 탐지할 수 있음

Unified Detection

YOLO는 end-to-end training, 높은 AP를 유지하면서 real-time speed가 가능함

input image는 S x S grid로 나눠지고, 각 grid cell은 B개의 bounding box와 그에 대한 confidence scores가 예측됨

confidence는 object가 없을 경우는 0, 있을 경우는 IoU와 동일하길 바람

Bounding box : x,y,w,h,confidence / (x,y) : box의 center 좌표

모든 grid cell은 C(conditional class probabilities)를 예측

grid cell이 class i에 해당하는 object를 포함하고 있을 확률 Pr(Class_i Object)

class-specific confidence scores for each box

memoryblock

evaluation을 위해 PASCAL VOC dataset을 사용하였고, S=7 B=2로 setting (C=20) : 7x7x30 tensor

Network Design

memoryblock

초기 layer : image로부터 features 추출

fully connected layers : probabilities와 bounding box coordinates 예측

24개의convolutional layers와 2개의 fully connected layers로 구성 (GoogLeNet에서 영감)

GoogLeNet의 inception module 대신, 3x3 convolutional layers 이전 layer로 단순한 1x1 reduction layers 사용

Fast YOLO 구조 : 9개의 convolutional layers로 구성되어 있으며, layer는 더 적은 filters를 가짐

YOLO network 사이즈는 다르지만, 동일한 parameters로 이루어짐 (7x7x30 tensor)

2.2 Training

Pretraining을 위해, 20개의 convolutional layers(하나의 average-pooling layer와 하나의 fully connected layer가 따르는) 사용

ImageNet 2012 validation set으로 pretrain (input:224x224)

Pretrained model에 4개의 convolutional layer와 2개의 fully connected layer를 추가하여 성능을 향상시킴 (randomly initialized weights) [29]

detection은 fine-grained visual information이 필요하기 때문에 input resolution을 448x448로 늘림

YOLO의 최종 layer는 class probabilities와 bounding box coordinates 예측

최종 layer를 위해 linear activation function 사용 (다른 모든 layer에는 leaky rectified linear activation 사용)

Sum-squared error를 사용하여 optimization

1) lacalization error와 classification error를 동일하게 측정하면서, object를 포함하지 않는 다량의 box들에 의해 학습이 불안정해짐

bounding box coordinate prediction loss를 키우고, confidence prediction loss를 줄이기 위해 balancing term 적용

2) Sum-squared error는 large boxes와 small boxes의 error를 동일하게 측정

부분적으로 해결하기 위해, 너비와 높이 직접 예측하는 대신 bounding box 너비와 높이의 제곱근을 예측 ??

YOLO는 학습 과정에서 grid cell 별로 다양한 bounding boxes를 예측

object를 예측하기 위해 하나의 predictor를 지정 (가장 높은 IoU 값을 갖는 prediction)

각 predictor는 특정 사이즈, 가로 세로 비율, object의 class를 더 잘 예측하면서 전반적인 recall을 개선시킴

Loss function

memoryblock

Share on

Twitter Facebook LinkedIn

Hyeyoon Kang

[Paper Review] You only look once: Unified, real-time object detection

You only look once: Unified, real-time object detection

Abstract

Introduction

Unified Detection

2.2 Training

Share on

Leave a comment

You may also enjoy

[Paper Review] Language models are few-shot learners

[Paper Review] Language models are unsupervised multitask learners

[Paper Review] Exploring the limits of transfer learning with a unified text-to-text transformer

[Paper Review] Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension