[Paper Review] Efficient and accurate arbitrary-shaped text detection with pixel aggregation network

3 minute read

Efficient and accurate arbitrary-shaped text detection with pixel aggregation network

Wang, Wenhai, et al. “Efficient and accurate arbitrary-shaped text detection with pixel aggregation network.” Proceedings of the IEEE International Conference on Computer Vision. 2019.

Abstract

Text reading systems에서 중요한 단계인 scene text detection은 convolutional neural networks와 함께 빠르게 발전함
하지만 이를 real-world applications에 적용하기 어려운 2가지 challenges가 존재함

1) 속도와 정확도 사이의 trade-off 관계

2) 임의의 shape를 갖는 text를 다루기 힘듦 (arbitrary-shaped text instance)

최근, 임의의 shape를 다룰 수 있는 방식들이 제안되었으나 속도 문제로 실용적이지 못함

낮은 계산 비용의 segmentation head와 learnable post-processing으로 이루어진 Pixel Aggregation Network(PAN) 제안

efficient and accurate arbitrary-shaped text detector

Segmentation head : Feature Pyramid Enhancement Module(FPEM), Feature Fusion Module(FFM)

1) FPEM : cascadable U-shaped module로, 더 좋은 segmentation을 위해 multi-level 정보를 이용

2) FFM : FPEM에서 받은 features(different depths??)를 segmentation을 위한 최종 feature로 융합시킴

Pixel Aggregation(PA)로 구현된 learnable post-processing 수행

예측된 similarity vectors로 text pixels을 정확하게 융합시킬 수 있음

standard benchmarks에서 실험을 하여 PAN의 superiority를 증명

CTW1500에서 84.2 FPS로 79.9%의 F-measure를 달성 (빠르고 정확한 PAN)

memoryblock

Introduction

Scene text detection은 computer vision task(text-related)에서 기초적이고 중요한 task임

text recognition, text retrieval, license plate recognition, text visual quenstion answering 등등

CNN을 통한 object detection과 segmentation의 발전으로 scene text detection이 큰 발전을 이룸
주된 challenges 중 하나인 Arbitrary-shaped text detection에 많은 연구가 집중됨 (curved text instance)

이러한 방식들은 heavy models 또는 복잡한 후처리 과정으로 인해 속도가 느리다는 문제점이 있음

한편, 높은 효율성을 갖는 이전의 text detectors는 대부분 quadrangular text instances를 예측하기 때문에 curved text를 예측하기에 어려움이 있음

How to design an efficient and accurate arbitrary-shaped text detector?

memoryblock

Arbitrary-shaped text detector인 Pixel Aggregation Network(PAN) 제안

속도와 성능의 balance가 적절함

PAN은 단순한 pipeline으로 arbitrary-shaped text detection을 만듦

1) segmentation network로 text regions, kernels, similarity vectors를 예측

2) 예측된 kernels을 통해 완전한 text instances를 rebuilding

segmentation에 lightweight backbone을 사용 (ResNet 18)

lightweight backbone의 문제점 : 약한 feature 추출, 작은 receptive fields, 약한 representation capabilities

위와 같은 문제점을 해결하기 위해, 낮은 계산 비용을 갖는 segmentation head를 제안

Feature Pyramid Enhancement Module(FPEM), Feature Fusion Module(FFM)로 구성

Feature Pyramid Enhancement Module(FPEM)

memoryblock

separable convolutions로 구성된 U-shaped module

최소 비용으로 low-level과 high-level 정보를 융합시켜 다른 scales의 features를 향상시킴

lightweight backbone의 depth를 보완하기 위해 cascade 구조로 FPEMs를 사용

Feature Fusion Module(FFM)

memoryblock

최종 segmentation 전에 low-level과 high-level 정보를 합치기 위해 사용하는 모듈

다른 depths의 FPEMs에서 생성된 features를 융합

Learnable post-processing 방식을 제안 : Pixel Aggregation(PA)

완전한 text instances를 정확하게 재구성하기 위해, 예측된 similarity vectors를 통해 text pixels을 수정

challenging benchmark datasets(CTW1500, Total-Text, ICDAR2015, MSRA-TD500)에서 실험을 진행

STW1500, Total-Text는 curve text detection을 위한 새로운 datasets

PAN은 multi-oriented text와 long text에서도 좋은 결과를 보임

Contribution

FPEM과 FFM으로 구성된 lightweight segmentation neck을 제안

network의 feature representation 능력을 향상

Pixel Aggregation(PA)를 제안

text similarity vector가 network를 통해 학습되고, 이를 통해 선택적으로 text kernel 주변 pixels을 aggregation

제안된 method는 2개의 curved text benchmarks에서 SOTA 달성하였고, inference 속도는 58 FPS 유지(real-time)

Proposed Method

memoryblock

3.1 Overall Architecture

segmentation head는 2개의 module로 구성

1) Feature Pyramid Enhancement Module(FPEM)

cascadable, 낮은 계산 비용

backbone 뒤에 붙여서 사용하며, 깊고 expressive한 다른 scales의 feature를 생성

2) Feature Fusion Module(FFM)

다른 depths의 FPEMs에 의해 생성된 features를 segmentation을 위한 최종적인 하나의 feature로 융합시킴

PAN은 text regions, kernels, similarity vector 총 3가지를 예측

lightweight model로 ResNet-18을 사용하고 4개의 feature maps을 생성 (stride:4,8,16,32)
thin feature pyramid를 얻기 위해 1x1 convolution을 사용하여 channel을 감소시킴

동작방식

1) 각 FPEM으로 향상된 feature pyramid를 생성 (nc enhanced feature pyramids)

2) FFM이 feature pyramids를 하나의 feature map F로 융합 (stride:4, channel number:512)

3) feature map F로 text regions, kernels, similarity vectors를 예측

4) 최종적으로, text instances를 얻기 위해 단순하고 효율적인 post-processing 알고리즘을 적용

3.2 Feature Pyramid Enhancement Module

memoryblock

U-shaped module : up-scale enhancement, down-scale enhancement
separable convolution 사용 : 3x3 depthwise convolution, lxl projection

3x3 depthwise convolution를 통해 receptive field를 확장시킬 수 있고, 1x1 convolution을 통해 적은 계산 비용으로 깊은 network를 만듦

FPN와 유사한 점 : low-level과 high-level 정보를 융합시키면서 다른 scales의 features를 향상시킴
FPN과 다른 2가지

1) cascade module : 다른 scales의 feature maps을 적절하게 융합시키고, features의 receptive fields를 더 크게 만듦

2) 적은 계산 비용 (separable convolution)

3.3 Feature Fusion Module

memoryblock

다른 depths의 feature pyramids를 융합시키기 위한 모듈

1) 동일한 scale의 feature maps을 element-wise addition을 통해 결합

2) feature maps을 upsample 시키고, concatenation 수행하여 final feature map을 얻음 (4x128 channels)

3.4 Pixel Aggregation(Post-processing)

완전한 text instance를 재구성하기 위해서는 text regions의 pixels을 kernels에 병합하여야 함
Pixel Aggregation 제안 (learnable algorithm)

text pixels을 올바른 kernels로 guide

kernels로부터 완전한 text instances를 재구성하기 위해 clustering 기법을 사용

text instances는 clusters, kernels은 cluster centers로 간주

같은 text instance의 text pixel과 kernel 사이 차이를 줄이는 것을 목표로 loss를 학습시킴

Aggregation loss

동일한 text instance의 kernel과 text pixel 사이 차이를 줄이기 위한 loss

memoryblock

Discrimination loss

다른 text instance의 kernel을 구별하기 위한 loss

memoryblock

3.5 Loss function

memoryblock

Share on

Twitter Facebook LinkedIn

Hyeyoon Kang

[Paper Review] Efficient and accurate arbitrary-shaped text detection with pixel aggregation network

Efficient and accurate arbitrary-shaped text detection with pixel aggregation network

Abstract

Introduction

How to design an efficient and accurate arbitrary-shaped text detector?

Contribution

Proposed Method

3.1 Overall Architecture

3.2 Feature Pyramid Enhancement Module

3.3 Feature Fusion Module

3.4 Pixel Aggregation(Post-processing)

3.5 Loss function

Share on

Leave a comment

You may also enjoy

[Paper Review] Language models are few-shot learners

[Paper Review] Language models are unsupervised multitask learners

[Paper Review] Exploring the limits of transfer learning with a unified text-to-text transformer

[Paper Review] Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension