[Paper Review] PlugNet: Degradation Aware Scene Text Recognition Supervised by a Pluggable Super-Resolution Unit

3 minute read

PlugNet: Degradation Aware Scene Text Recognition Supervised by a Pluggable Super-Resolution Unit

Mou, Yongqiang, et al. “PlugNet: Degradation Aware Scene Text Recognition Supervised by a Pluggable Super-Resolution Unit.”

Abstract

기존의 문제점 : high blur나 low-resolution을 갖는 images에 의해 recognition 성능이 감소

low-quality scene text를 인식하기 위한 네트워크를 제안

PlugNet : pluggable super-resolution unit(PSU)을 갖는 scene text recognizer

PSU와 함께 end-to-end로 학습이 가능

low-resolution으로 인한 문제를 feature level에서 해결

PSU는 inference time에 제거할 수 있어서 추가적인 계산비용이 들지 않음

Pluggable super-resolution unit : low-quality text images를 잘 인식하기 위한 더 강력한 feature representation 학습
Feature enhancement를 위한 2가지 전략

1) Feature Squeeze Module : spatial acuity의 손실을 감소

2) Feature Enhancement Module : 다양한 semantics을 얻기 위해 low-level features와 high-level features를 결합

Text recognition benchmarks(IIIT5K,SVT,SVTP,ICDAR15,etc)에서 SOTA 성능 달성

Introduction

SOTA scene text recognizers는 2가지 categories로 나뉨

1) The bottom-up approaches : 글자 단위에서 text를 인식

2) The top-down approaches : 전체 이미지에서 text 인식

문제점 : nosing, blurred or low-resolution으로 인한 low-quality images가 잘못된 결과를 초래

기존 다른 컴퓨터 비전 tasks에서 low-quality images를 해결할 때, image-level에서 문제를 다룸

SR network + text recognition network는 계산 효율성이 좋지 않음 (시간이 너무 오래 걸림, fig.1.b 첫번째 그림)

memoryblock

기존의 방식들과 다르게 feature-level에서 degradation images를 해결함 (fig.1 b 두번째 그림)
pluggable super-resolution unit과 함께 end-to-end 학습이 가능한 scene text recognizer 제안 (PlugNet)

4-parts : rectification network - CNN backbone - recognition network - pluggable super-resolution unit(PSU)

training stage에서 feature quality를 향상시키기 위해 upsampling layers와 적은 convolution layers로 구성된 light-weight pluggable super-resolution unit을 사용

inference stage에서 PSU를 제거하기 때문에 추가적인 계산 비용은 들지 않음

많은 text recognition framework들이 CNN-LSTM을 사용하여 높은 성능을 보임

한계 : CNN은 rotation, shift와 같은 spatial-level issues에서 제한된 성능을 보임

spatial acuity(예리함)의 손실로 인해 recognition part와 rectified part 모두 효과적인 학습이 어려움

최종 one-dimension vectors에서 더 많은 spatial 정보를 유지하기 위해 Feature Squeeze Module을 제안
Feature Squeeze Module

feature resolution을 유지하기 위해 마지막 3개의 blocks에서 down-sampling convolution layers를 제거

feature maps에서 one-dimension vectors를 생성하기 위해, 하나의 1x1 convolution layer와 하나의 reshape layer 사용

모든 datasets에서 상당한 성능향상을 보임

Featrue Pyramid Networks에서 영감을 얻어, Feature Enhancement Module(FEM) 제안

low-level에서 high-level로 semantics 정보 결합

The main contributions

1) end-to-end trainable scene text recognizer(PlugNet)

2) feature squeeze module(FSM)

CNN-based backbone과 LSTM-based recognition model을 연결시키는 방식을 제공하고, 이는 top-down text recognition 방식을 위한 baseline으로 사용될 수 있음

3) feature enhance module(FEM)

low-level features와 high-level features를 결합시켜 sharing feature maps을 강화시킴

4) the state-of-the-art performance

Approach

3.1 Overall Framework

memoryblock

Rectification Network (fig2.a)

irregular scene text를 rectication 시키기 위한 네트워크

Aster(irregular scene text recognition에서 높은 성능)와 동일한 방식을 사용함

3-parts : localization network - grid generator - sampler

1) localization network : CNN-based network로 input img에서 n개의 control points로 text의 경계선(borders)을 localize

2) grid generator : localization 결과를 활용하고, Thin-Plate-Spline(TPS)를 통해 각 pixel에 대한 transformation matrix를 계산

3) sampler : rectifed images를 생성

Sharing CNN Backbone (fig2.b)

memoryblock

feature를 추출하기 위해 ResNet-based 구조를 사용

Aster와 비슷한 구조이지만 더 많은 spatial 정보를 유지하기 위해, 마지막 3개의 CNN blocks에서 down-sampling layers를 제거

Recognition Part (fig2.d,e)

ESIR, Aster를 따라서 text recognition을 위해 LSTM-based 방식을 사용 (전체 sequences를 학습)

1) Feature Squeeze Module

sharing CNN backbone에서 나온 features가 입력으로 들어가고, one-dimension vectors 생성

2) Recognition Head : squence-to-sequence

two-layer Bidirectional LSTM(BiLSTM) : 양방향 long-range dependencies를 포착하여 강력한 새로운 sequence H 생성 (input과 동일한 길이), encoder

two-layer attentional LSTM : sequence H를 output sequence Y로 변형시킴, decoder

Pluggable ST Unit

FSM에 의해, sharing CNN backbone이 image resolution을 유지할 수 있기 때문에 네트워크에 PSU를 붙이기 쉬움

high-level features에서 super-resolution images를 만듦

3.2 Pluggable Super-resolution Unit

feature-level에서 degradation images를 해결하기 위해 디자인됨
PSU는 sharing CNN backbone이 degradation images의 features를 더 잘 representation하도록 도움을 줌
RCAN 구조를 이용하여 PSU를 만듦
하나의 Residual Group(RG)를 생성하기 위해 2개의 two Residual Channel Attention Block(RCAB)를 사용

2개의 RG는 최종 PSU를 만드는데 사용됨

inference stage에서 PSU는 제거되기 때문에 추가적인 계산 비용이 들지 않는 것이 장점

3.3 Feature Enhancement

Feature Squeeze Module (fig2.d)

one-dimension vectors에서 더 많은 resolution 정보를 유지하기 위해 down-sampling convolution layers를 FSM으로 대체

FSM은 channels을 감소기키기 위한 하나의 1x1 convolution layer와 feature maps에서 one-dimension vectors를 생성하기 위한 하나의 reshape layer로 구성(적은 계산 비용)

Feature Enhance Module (fig2.b)

low-level에서 high-level로 features를 결합시킴

down-sampling layer를 통해 shape을 변형시키고, low-level에서 high-level로 모든 feature maps을 concatenation 수행하여 향상된 feature를 얻음

Share on

Twitter Facebook LinkedIn

Hyeyoon Kang

[Paper Review] PlugNet: Degradation Aware Scene Text Recognition Supervised by a Pluggable Super-Resolution Unit

PlugNet: Degradation Aware Scene Text Recognition Supervised by a Pluggable Super-Resolution Unit

Abstract

Introduction

Approach

3.1 Overall Framework

3.2 Pluggable Super-resolution Unit

3.3 Feature Enhancement

Share on

Leave a comment

You may also enjoy

[Paper Review] Language models are few-shot learners

[Paper Review] Language models are unsupervised multitask learners

[Paper Review] Exploring the limits of transfer learning with a unified text-to-text transformer

[Paper Review] Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension