Merge pull request #17570 from HannibalAPE:text_det_recog_demo
[GSoC] High Level API and Samples for Scene Text Detection and Recognition * APIs and samples for scene text detection and recognition * update APIs and tutorial for Text Detection and Recognition * API updates: (1) put decodeType into struct Voc (2) optimize the post-processing of DB * sample update: (1) add transformation into scene_text_spotting.cpp (2) modify text_detection.cpp with API update * update tutorial * simplify text recognition API update tutorial * update impl usage in recognize() and detect() * dnn: refactoring public API of TextRecognitionModel/TextDetectionModel * update provided models update opencv.bib * dnn: adjust text rectangle angle * remove points ordering operation in model.cpp * update gts of DB test in test_model.cpp * dnn: ensure to keep text rectangle angle - avoid 90/180 degree turns * dnn(text): use quadrangle result in TextDetectionModel API * dnn: update Text Detection API (1) keep points' order consistent with (bl, tl, tr, br) in unclip (2) update contourScore with boundingRectpull/19012/head
parent
5ecf693774
commit
22d64ae08f
19 changed files with 2340 additions and 182 deletions
After Width: | Height: | Size: 40 KiB |
After Width: | Height: | Size: 90 KiB |
@ -0,0 +1,316 @@ |
|||||||
|
# High Level API: TextDetectionModel and TextRecognitionModel {#tutorial_dnn_text_spotting} |
||||||
|
|
||||||
|
@prev_tutorial{tutorial_dnn_OCR} |
||||||
|
|
||||||
|
## Introduction |
||||||
|
In this tutorial, we will introduce the APIs for TextRecognitionModel and TextDetectionModel in detail. |
||||||
|
|
||||||
|
--- |
||||||
|
#### TextRecognitionModel: |
||||||
|
|
||||||
|
In the current version, @ref cv::dnn::TextRecognitionModel only supports CNN+RNN+CTC based algorithms, |
||||||
|
and the greedy decoding method for CTC is provided. |
||||||
|
For more information, please refer to the [original paper](https://arxiv.org/abs/1507.05717) |
||||||
|
|
||||||
|
Before recognition, you should `setVocabulary` and `setDecodeType`. |
||||||
|
- "CTC-greedy", the output of the text recognition model should be a probability matrix. |
||||||
|
The shape should be `(T, B, Dim)`, where |
||||||
|
- `T` is the sequence length |
||||||
|
- `B` is the batch size (only support `B=1` in inference) |
||||||
|
- and `Dim` is the length of vocabulary +1('Blank' of CTC is at the index=0 of Dim). |
||||||
|
|
||||||
|
@ref cv::dnn::TextRecognitionModel::recognize() is the main function for text recognition. |
||||||
|
- The input image should be a cropped text image or an image with `roiRects` |
||||||
|
- Other decoding methods may supported in the future |
||||||
|
|
||||||
|
--- |
||||||
|
|
||||||
|
#### TextDetectionModel: |
||||||
|
|
||||||
|
@ref cv::dnn::TextDetectionModel API provides these methods for text detection: |
||||||
|
- cv::dnn::TextDetectionModel::detect() returns the results in std::vector<std::vector<Point>> (4-points quadrangles) |
||||||
|
- cv::dnn::TextDetectionModel::detectTextRectangles() returns the results in std::vector<cv::RotatedRect> (RBOX-like) |
||||||
|
|
||||||
|
In the current version, @ref cv::dnn::TextDetectionModel supports these algorithms: |
||||||
|
- use @ref cv::dnn::TextDetectionModel_DB with "DB" models |
||||||
|
- and use @ref cv::dnn::TextDetectionModel_EAST with "EAST" models |
||||||
|
|
||||||
|
The following provided pretrained models are variants of DB (w/o deformable convolution), |
||||||
|
and the performance can be referred to the Table.1 in the [paper]((https://arxiv.org/abs/1911.08947)). |
||||||
|
For more information, please refer to the [official code](https://github.com/MhLiao/DB) |
||||||
|
|
||||||
|
--- |
||||||
|
|
||||||
|
You can train your own model with more data, and convert it into ONNX format. |
||||||
|
We encourage you to add new algorithms to these APIs. |
||||||
|
|
||||||
|
|
||||||
|
## Pretrained Models |
||||||
|
|
||||||
|
#### TextRecognitionModel: |
||||||
|
|
||||||
|
``` |
||||||
|
crnn.onnx: |
||||||
|
url: https://drive.google.com/uc?export=dowload&id=1ooaLR-rkTl8jdpGy1DoQs0-X0lQsB6Fj |
||||||
|
sha: 270d92c9ccb670ada2459a25977e8deeaf8380d3, |
||||||
|
alphabet_36.txt: https://drive.google.com/uc?export=dowload&id=1oPOYx5rQRp8L6XQciUwmwhMCfX0KyO4b |
||||||
|
parameter setting: -rgb=0; |
||||||
|
description: The classification number of this model is 36 (0~9 + a~z). |
||||||
|
The training dataset is MJSynth. |
||||||
|
|
||||||
|
crnn_cs.onnx: |
||||||
|
url: https://drive.google.com/uc?export=dowload&id=12diBsVJrS9ZEl6BNUiRp9s0xPALBS7kt |
||||||
|
sha: a641e9c57a5147546f7a2dbea4fd322b47197cd5 |
||||||
|
alphabet_94.txt: https://drive.google.com/uc?export=dowload&id=1oKXxXKusquimp7XY1mFvj9nwLzldVgBR |
||||||
|
parameter setting: -rgb=1; |
||||||
|
description: The classification number of this model is 94 (0~9 + a~z + A~Z + punctuations). |
||||||
|
The training datasets are MJsynth and SynthText. |
||||||
|
|
||||||
|
crnn_cs_CN.onnx: |
||||||
|
url: https://drive.google.com/uc?export=dowload&id=1is4eYEUKH7HR7Gl37Sw4WPXx6Ir8oQEG |
||||||
|
sha: 3940942b85761c7f240494cf662dcbf05dc00d14 |
||||||
|
alphabet_3944.txt: https://drive.google.com/uc?export=dowload&id=18IZUUdNzJ44heWTndDO6NNfIpJMmN-ul |
||||||
|
parameter setting: -rgb=1; |
||||||
|
description: The classification number of this model is 3944 (0~9 + a~z + A~Z + Chinese characters + special characters). |
||||||
|
The training dataset is ReCTS (https://rrc.cvc.uab.es/?ch=12). |
||||||
|
``` |
||||||
|
|
||||||
|
More models can be found in [here](https://drive.google.com/drive/folders/1cTbQ3nuZG-EKWak6emD_s8_hHXWz7lAr?usp=sharing), |
||||||
|
which are taken from [clovaai](https://github.com/clovaai/deep-text-recognition-benchmark). |
||||||
|
You can train more models by [CRNN](https://github.com/meijieru/crnn.pytorch), and convert models by `torch.onnx.export`. |
||||||
|
|
||||||
|
#### TextDetectionModel: |
||||||
|
|
||||||
|
``` |
||||||
|
- DB_IC15_resnet50.onnx: |
||||||
|
url: https://drive.google.com/uc?export=dowload&id=17_ABp79PlFt9yPCxSaarVc_DKTmrSGGf |
||||||
|
sha: bef233c28947ef6ec8c663d20a2b326302421fa3 |
||||||
|
recommended parameter setting: -inputHeight=736, -inputWidth=1280; |
||||||
|
description: This model is trained on ICDAR2015, so it can only detect English text instances. |
||||||
|
|
||||||
|
- DB_IC15_resnet18.onnx: |
||||||
|
url: https://drive.google.com/uc?export=dowload&id=1sZszH3pEt8hliyBlTmB-iulxHP1dCQWV |
||||||
|
sha: 19543ce09b2efd35f49705c235cc46d0e22df30b |
||||||
|
recommended parameter setting: -inputHeight=736, -inputWidth=1280; |
||||||
|
description: This model is trained on ICDAR2015, so it can only detect English text instances. |
||||||
|
|
||||||
|
- DB_TD500_resnet50.onnx: |
||||||
|
url: https://drive.google.com/uc?export=dowload&id=19YWhArrNccaoSza0CfkXlA8im4-lAGsR |
||||||
|
sha: 1b4dd21a6baa5e3523156776970895bd3db6960a |
||||||
|
recommended parameter setting: -inputHeight=736, -inputWidth=736; |
||||||
|
description: This model is trained on MSRA-TD500, so it can detect both English and Chinese text instances. |
||||||
|
|
||||||
|
- DB_TD500_resnet18.onnx: |
||||||
|
url: https://drive.google.com/uc?export=dowload&id=1vY_KsDZZZb_svd5RT6pjyI8BS1nPbBSX |
||||||
|
sha: 8a3700bdc13e00336a815fc7afff5dcc1ce08546 |
||||||
|
recommended parameter setting: -inputHeight=736, -inputWidth=736; |
||||||
|
description: This model is trained on MSRA-TD500, so it can detect both English and Chinese text instances. |
||||||
|
|
||||||
|
``` |
||||||
|
|
||||||
|
We will release more models of DB [here](https://drive.google.com/drive/folders/1qzNCHfUJOS0NEUOIKn69eCtxdlNPpWbq?usp=sharing) in the future. |
||||||
|
|
||||||
|
``` |
||||||
|
- EAST: |
||||||
|
Download link: https://www.dropbox.com/s/r2ingd0l3zt8hxs/frozen_east_text_detection.tar.gz?dl=1 |
||||||
|
This model is based on https://github.com/argman/EAST |
||||||
|
``` |
||||||
|
|
||||||
|
## Images for Testing |
||||||
|
|
||||||
|
``` |
||||||
|
Text Recognition: |
||||||
|
url: https://drive.google.com/uc?export=dowload&id=1nMcEy68zDNpIlqAn6xCk_kYcUTIeSOtN |
||||||
|
sha: 89205612ce8dd2251effa16609342b69bff67ca3 |
||||||
|
|
||||||
|
Text Detection: |
||||||
|
url: https://drive.google.com/uc?export=dowload&id=149tAhIcvfCYeyufRoZ9tmc2mZDKE_XrF |
||||||
|
sha: ced3c03fb7f8d9608169a913acf7e7b93e07109b |
||||||
|
``` |
||||||
|
|
||||||
|
## Example for Text Recognition |
||||||
|
|
||||||
|
Step1. Loading images and models with a vocabulary |
||||||
|
|
||||||
|
```cpp |
||||||
|
// Load a cropped text line image |
||||||
|
// you can find cropped images for testing in "Images for Testing" |
||||||
|
int rgb = IMREAD_COLOR; // This should be changed according to the model input requirement. |
||||||
|
Mat image = imread("path/to/text_rec_test.png", rgb); |
||||||
|
|
||||||
|
// Load models weights |
||||||
|
TextRecognitionModel model("path/to/crnn_cs.onnx"); |
||||||
|
|
||||||
|
// The decoding method |
||||||
|
// more methods will be supported in future |
||||||
|
model.setDecodeType("CTC-greedy"); |
||||||
|
|
||||||
|
// Load vocabulary |
||||||
|
// vocabulary should be changed according to the text recognition model |
||||||
|
std::ifstream vocFile; |
||||||
|
vocFile.open("path/to/alphabet_94.txt"); |
||||||
|
CV_Assert(vocFile.is_open()); |
||||||
|
String vocLine; |
||||||
|
std::vector<String> vocabulary; |
||||||
|
while (std::getline(vocFile, vocLine)) { |
||||||
|
vocabulary.push_back(vocLine); |
||||||
|
} |
||||||
|
model.setVocabulary(vocabulary); |
||||||
|
``` |
||||||
|
|
||||||
|
Step2. Setting Parameters |
||||||
|
|
||||||
|
```cpp |
||||||
|
// Normalization parameters |
||||||
|
double scale = 1.0 / 127.5; |
||||||
|
Scalar mean = Scalar(127.5, 127.5, 127.5); |
||||||
|
|
||||||
|
// The input shape |
||||||
|
Size inputSize = Size(100, 32); |
||||||
|
|
||||||
|
model.setInputParams(scale, inputSize, mean); |
||||||
|
``` |
||||||
|
Step3. Inference |
||||||
|
```cpp |
||||||
|
std::string recognitionResult = recognizer.recognize(image); |
||||||
|
std::cout << "'" << recognitionResult << "'" << std::endl; |
||||||
|
``` |
||||||
|
|
||||||
|
Input image: |
||||||
|
|
||||||
|
 |
||||||
|
|
||||||
|
Output: |
||||||
|
``` |
||||||
|
'welcome' |
||||||
|
``` |
||||||
|
|
||||||
|
|
||||||
|
## Example for Text Detection |
||||||
|
|
||||||
|
Step1. Loading images and models |
||||||
|
```cpp |
||||||
|
// Load an image |
||||||
|
// you can find some images for testing in "Images for Testing" |
||||||
|
Mat frame = imread("/path/to/text_det_test.png"); |
||||||
|
``` |
||||||
|
|
||||||
|
Step2.a Setting Parameters (DB) |
||||||
|
```cpp |
||||||
|
// Load model weights |
||||||
|
TextDetectionModel_DB model("/path/to/DB_TD500_resnet50.onnx"); |
||||||
|
|
||||||
|
// Post-processing parameters |
||||||
|
float binThresh = 0.3; |
||||||
|
float polyThresh = 0.5; |
||||||
|
uint maxCandidates = 200; |
||||||
|
double unclipRatio = 2.0; |
||||||
|
model.setBinaryThreshold(binThresh) |
||||||
|
.setPolygonThreshold(polyThresh) |
||||||
|
.setMaxCandidates(maxCandidates) |
||||||
|
.setUnclipRatio(unclipRatio) |
||||||
|
; |
||||||
|
|
||||||
|
// Normalization parameters |
||||||
|
double scale = 1.0 / 255.0; |
||||||
|
Scalar mean = Scalar(122.67891434, 116.66876762, 104.00698793); |
||||||
|
|
||||||
|
// The input shape |
||||||
|
Size inputSize = Size(736, 736); |
||||||
|
|
||||||
|
model.setInputParams(scale, inputSize, mean); |
||||||
|
``` |
||||||
|
|
||||||
|
Step2.b Setting Parameters (EAST) |
||||||
|
```cpp |
||||||
|
TextDetectionModel_EAST model("EAST.pb"); |
||||||
|
|
||||||
|
float confThreshold = 0.5; |
||||||
|
float nmsThreshold = 0.4; |
||||||
|
model.setConfidenceThreshold(confThresh) |
||||||
|
.setNMSThreshold(nmsThresh) |
||||||
|
; |
||||||
|
|
||||||
|
double detScale = 1.0; |
||||||
|
Size detInputSize = Size(320, 320); |
||||||
|
Scalar detMean = Scalar(123.68, 116.78, 103.94); |
||||||
|
bool swapRB = true; |
||||||
|
model.setInputParams(detScale, detInputSize, detMean, swapRB); |
||||||
|
``` |
||||||
|
|
||||||
|
|
||||||
|
Step3. Inference |
||||||
|
```cpp |
||||||
|
std::vector<std::vector<Point>> detResults; |
||||||
|
model.detect(detResults); |
||||||
|
|
||||||
|
// Visualization |
||||||
|
polylines(frame, results, true, Scalar(0, 255, 0), 2); |
||||||
|
imshow("Text Detection", image); |
||||||
|
waitKey(); |
||||||
|
``` |
||||||
|
|
||||||
|
Output: |
||||||
|
|
||||||
|
 |
||||||
|
|
||||||
|
## Example for Text Spotting |
||||||
|
|
||||||
|
After following the steps above, it is easy to get the detection results of an input image. |
||||||
|
Then, you can do transformation and crop text images for recognition. |
||||||
|
For more information, please refer to **Detailed Sample** |
||||||
|
```cpp |
||||||
|
// Transform and Crop |
||||||
|
Mat cropped; |
||||||
|
fourPointsTransform(recInput, vertices, cropped); |
||||||
|
|
||||||
|
String recResult = recognizer.recognize(cropped); |
||||||
|
``` |
||||||
|
|
||||||
|
Output Examples: |
||||||
|
|
||||||
|
 |
||||||
|
|
||||||
|
 |
||||||
|
|
||||||
|
## Source Code |
||||||
|
The [source code](https://github.com/opencv/opencv/blob/master/modules/dnn/src/model.cpp) |
||||||
|
of these APIs can be found in the DNN module. |
||||||
|
|
||||||
|
## Detailed Sample |
||||||
|
For more information, please refer to: |
||||||
|
- [samples/dnn/scene_text_recognition.cpp](https://github.com/opencv/opencv/blob/master/samples/dnn/scene_text_recognition.cpp) |
||||||
|
- [samples/dnn/scene_text_detection.cpp](https://github.com/opencv/opencv/blob/master/samples/dnn/scene_text_detection.cpp) |
||||||
|
- [samples/dnn/text_detection.cpp](https://github.com/opencv/opencv/blob/master/samples/dnn/text_detection.cpp) |
||||||
|
- [samples/dnn/scene_text_spotting.cpp](https://github.com/opencv/opencv/blob/master/samples/dnn/scene_text_spotting.cpp) |
||||||
|
|
||||||
|
#### Test with an image |
||||||
|
Examples: |
||||||
|
```bash |
||||||
|
example_dnn_scene_text_recognition -mp=path/to/crnn_cs.onnx -i=path/to/an/image -rgb=1 -vp=/path/to/alphabet_94.txt |
||||||
|
example_dnn_scene_text_detection -mp=path/to/DB_TD500_resnet50.onnx -i=path/to/an/image -ih=736 -iw=736 |
||||||
|
example_dnn_scene_text_spotting -dmp=path/to/DB_IC15_resnet50.onnx -rmp=path/to/crnn_cs.onnx -i=path/to/an/image -iw=1280 -ih=736 -rgb=1 -vp=/path/to/alphabet_94.txt |
||||||
|
example_dnn_text_detection -dmp=path/to/EAST.pb -rmp=path/to/crnn_cs.onnx -i=path/to/an/image -rgb=1 -vp=path/to/alphabet_94.txt |
||||||
|
``` |
||||||
|
|
||||||
|
#### Test on public datasets |
||||||
|
Text Recognition: |
||||||
|
|
||||||
|
The download link for testing images can be found in the **Images for Testing** |
||||||
|
|
||||||
|
|
||||||
|
Examples: |
||||||
|
```bash |
||||||
|
example_dnn_scene_text_recognition -mp=path/to/crnn.onnx -e=true -edp=path/to/evaluation_data_rec -vp=/path/to/alphabet_36.txt -rgb=0 |
||||||
|
example_dnn_scene_text_recognition -mp=path/to/crnn_cs.onnx -e=true -edp=path/to/evaluation_data_rec -vp=/path/to/alphabet_94.txt -rgb=1 |
||||||
|
``` |
||||||
|
|
||||||
|
Text Detection: |
||||||
|
|
||||||
|
The download links for testing images can be found in the **Images for Testing** |
||||||
|
|
||||||
|
Examples: |
||||||
|
```bash |
||||||
|
example_dnn_scene_text_detection -mp=path/to/DB_TD500_resnet50.onnx -e=true -edp=path/to/evaluation_data_det/TD500 -ih=736 -iw=736 |
||||||
|
example_dnn_scene_text_detection -mp=path/to/DB_IC15_resnet50.onnx -e=true -edp=path/to/evaluation_data_det/IC15 -ih=736 -iw=1280 |
||||||
|
``` |
After Width: | Height: | Size: 48 KiB |
After Width: | Height: | Size: 2.8 KiB |
@ -0,0 +1,36 @@ |
|||||||
|
0 |
||||||
|
1 |
||||||
|
2 |
||||||
|
3 |
||||||
|
4 |
||||||
|
5 |
||||||
|
6 |
||||||
|
7 |
||||||
|
8 |
||||||
|
9 |
||||||
|
a |
||||||
|
b |
||||||
|
c |
||||||
|
d |
||||||
|
e |
||||||
|
f |
||||||
|
g |
||||||
|
h |
||||||
|
i |
||||||
|
j |
||||||
|
k |
||||||
|
l |
||||||
|
m |
||||||
|
n |
||||||
|
o |
||||||
|
p |
||||||
|
q |
||||||
|
r |
||||||
|
s |
||||||
|
t |
||||||
|
u |
||||||
|
v |
||||||
|
w |
||||||
|
x |
||||||
|
y |
||||||
|
z |
@ -0,0 +1,94 @@ |
|||||||
|
0 |
||||||
|
1 |
||||||
|
2 |
||||||
|
3 |
||||||
|
4 |
||||||
|
5 |
||||||
|
6 |
||||||
|
7 |
||||||
|
8 |
||||||
|
9 |
||||||
|
a |
||||||
|
b |
||||||
|
c |
||||||
|
d |
||||||
|
e |
||||||
|
f |
||||||
|
g |
||||||
|
h |
||||||
|
i |
||||||
|
j |
||||||
|
k |
||||||
|
l |
||||||
|
m |
||||||
|
n |
||||||
|
o |
||||||
|
p |
||||||
|
q |
||||||
|
r |
||||||
|
s |
||||||
|
t |
||||||
|
u |
||||||
|
v |
||||||
|
w |
||||||
|
x |
||||||
|
y |
||||||
|
z |
||||||
|
A |
||||||
|
B |
||||||
|
C |
||||||
|
D |
||||||
|
E |
||||||
|
F |
||||||
|
G |
||||||
|
H |
||||||
|
I |
||||||
|
J |
||||||
|
K |
||||||
|
L |
||||||
|
M |
||||||
|
N |
||||||
|
O |
||||||
|
P |
||||||
|
Q |
||||||
|
R |
||||||
|
S |
||||||
|
T |
||||||
|
U |
||||||
|
V |
||||||
|
W |
||||||
|
X |
||||||
|
Y |
||||||
|
Z |
||||||
|
! |
||||||
|
" |
||||||
|
# |
||||||
|
$ |
||||||
|
% |
||||||
|
& |
||||||
|
' |
||||||
|
( |
||||||
|
) |
||||||
|
* |
||||||
|
+ |
||||||
|
, |
||||||
|
- |
||||||
|
. |
||||||
|
/ |
||||||
|
: |
||||||
|
; |
||||||
|
< |
||||||
|
= |
||||||
|
> |
||||||
|
? |
||||||
|
@ |
||||||
|
[ |
||||||
|
\ |
||||||
|
] |
||||||
|
^ |
||||||
|
_ |
||||||
|
` |
||||||
|
{ |
||||||
|
| |
||||||
|
} |
||||||
|
~ |
@ -0,0 +1,151 @@ |
|||||||
|
#include <iostream> |
||||||
|
#include <fstream> |
||||||
|
#include <regex> |
||||||
|
|
||||||
|
#include <opencv2/imgproc.hpp> |
||||||
|
#include <opencv2/highgui.hpp> |
||||||
|
#include <opencv2/dnn/dnn.hpp> |
||||||
|
|
||||||
|
using namespace cv; |
||||||
|
using namespace cv::dnn; |
||||||
|
|
||||||
|
std::string keys = |
||||||
|
"{ help h | | Print help message. }" |
||||||
|
"{ inputImage i | | Path to an input image. Skip this argument to capture frames from a camera. }" |
||||||
|
"{ modelPath mp | | Path to a binary .onnx file contains trained DB detector model. " |
||||||
|
"Download links are provided in doc/tutorials/dnn/dnn_text_spotting/dnn_text_spotting.markdown}" |
||||||
|
"{ inputHeight ih |736| image height of the model input. It should be multiple by 32.}" |
||||||
|
"{ inputWidth iw |736| image width of the model input. It should be multiple by 32.}" |
||||||
|
"{ binaryThreshold bt |0.3| Confidence threshold of the binary map. }" |
||||||
|
"{ polygonThreshold pt |0.5| Confidence threshold of polygons. }" |
||||||
|
"{ maxCandidate max |200| Max candidates of polygons. }" |
||||||
|
"{ unclipRatio ratio |2.0| unclip ratio. }" |
||||||
|
"{ evaluate e |false| false: predict with input images; true: evaluate on benchmarks. }" |
||||||
|
"{ evalDataPath edp | | Path to benchmarks for evaluation. " |
||||||
|
"Download links are provided in doc/tutorials/dnn/dnn_text_spotting/dnn_text_spotting.markdown}"; |
||||||
|
|
||||||
|
int main(int argc, char** argv) |
||||||
|
{ |
||||||
|
// Parse arguments
|
||||||
|
CommandLineParser parser(argc, argv, keys); |
||||||
|
parser.about("Use this script to run the official PyTorch implementation (https://github.com/MhLiao/DB) of " |
||||||
|
"Real-time Scene Text Detection with Differentiable Binarization (https://arxiv.org/abs/1911.08947)\n" |
||||||
|
"The current version of this script is a variant of the original network without deformable convolution"); |
||||||
|
if (argc == 1 || parser.has("help")) |
||||||
|
{ |
||||||
|
parser.printMessage(); |
||||||
|
return 0; |
||||||
|
} |
||||||
|
|
||||||
|
float binThresh = parser.get<float>("binaryThreshold"); |
||||||
|
float polyThresh = parser.get<float>("polygonThreshold"); |
||||||
|
uint maxCandidates = parser.get<uint>("maxCandidate"); |
||||||
|
String modelPath = parser.get<String>("modelPath"); |
||||||
|
double unclipRatio = parser.get<double>("unclipRatio"); |
||||||
|
int height = parser.get<int>("inputHeight"); |
||||||
|
int width = parser.get<int>("inputWidth"); |
||||||
|
|
||||||
|
if (!parser.check()) |
||||||
|
{ |
||||||
|
parser.printErrors(); |
||||||
|
return 1; |
||||||
|
} |
||||||
|
|
||||||
|
// Load the network
|
||||||
|
CV_Assert(!modelPath.empty()); |
||||||
|
TextDetectionModel_DB detector(modelPath); |
||||||
|
detector.setBinaryThreshold(binThresh) |
||||||
|
.setPolygonThreshold(polyThresh) |
||||||
|
.setUnclipRatio(unclipRatio) |
||||||
|
.setMaxCandidates(maxCandidates); |
||||||
|
|
||||||
|
double scale = 1.0 / 255.0; |
||||||
|
Size inputSize = Size(width, height); |
||||||
|
Scalar mean = Scalar(122.67891434, 116.66876762, 104.00698793); |
||||||
|
detector.setInputParams(scale, inputSize, mean); |
||||||
|
|
||||||
|
// Create a window
|
||||||
|
static const std::string winName = "TextDetectionModel"; |
||||||
|
|
||||||
|
if (parser.get<bool>("evaluate")) { |
||||||
|
// for evaluation
|
||||||
|
String evalDataPath = parser.get<String>("evalDataPath"); |
||||||
|
CV_Assert(!evalDataPath.empty()); |
||||||
|
String testListPath = evalDataPath + "/test_list.txt"; |
||||||
|
std::ifstream testList; |
||||||
|
testList.open(testListPath); |
||||||
|
CV_Assert(testList.is_open()); |
||||||
|
|
||||||
|
// Create a window for showing groundtruth
|
||||||
|
static const std::string winNameGT = "GT"; |
||||||
|
|
||||||
|
String testImgPath; |
||||||
|
while (std::getline(testList, testImgPath)) { |
||||||
|
String imgPath = evalDataPath + "/test_images/" + testImgPath; |
||||||
|
std::cout << "Image Path: " << imgPath << std::endl; |
||||||
|
|
||||||
|
Mat frame = imread(samples::findFile(imgPath), IMREAD_COLOR); |
||||||
|
CV_Assert(!frame.empty()); |
||||||
|
Mat src = frame.clone(); |
||||||
|
|
||||||
|
// Inference
|
||||||
|
std::vector<std::vector<Point>> results; |
||||||
|
detector.detect(frame, results); |
||||||
|
|
||||||
|
polylines(frame, results, true, Scalar(0, 255, 0), 2); |
||||||
|
imshow(winName, frame); |
||||||
|
|
||||||
|
// load groundtruth
|
||||||
|
String imgName = testImgPath.substr(0, testImgPath.length() - 4); |
||||||
|
String gtPath = evalDataPath + "/test_gts/" + imgName + ".txt"; |
||||||
|
// std::cout << gtPath << std::endl;
|
||||||
|
std::ifstream gtFile; |
||||||
|
gtFile.open(gtPath); |
||||||
|
CV_Assert(gtFile.is_open()); |
||||||
|
|
||||||
|
std::vector<std::vector<Point>> gts; |
||||||
|
String gtLine; |
||||||
|
while (std::getline(gtFile, gtLine)) { |
||||||
|
size_t splitLoc = gtLine.find_last_of(','); |
||||||
|
String text = gtLine.substr(splitLoc+1); |
||||||
|
if ( text == "###\r" || text == "1") { |
||||||
|
// ignore difficult instances
|
||||||
|
continue; |
||||||
|
} |
||||||
|
gtLine = gtLine.substr(0, splitLoc); |
||||||
|
|
||||||
|
std::regex delimiter(","); |
||||||
|
std::vector<String> v(std::sregex_token_iterator(gtLine.begin(), gtLine.end(), delimiter, -1), |
||||||
|
std::sregex_token_iterator()); |
||||||
|
std::vector<int> loc; |
||||||
|
std::vector<Point> pts; |
||||||
|
for (auto && s : v) { |
||||||
|
loc.push_back(atoi(s.c_str())); |
||||||
|
} |
||||||
|
for (size_t i = 0; i < loc.size() / 2; i++) { |
||||||
|
pts.push_back(Point(loc[2 * i], loc[2 * i + 1])); |
||||||
|
} |
||||||
|
gts.push_back(pts); |
||||||
|
} |
||||||
|
polylines(src, gts, true, Scalar(0, 255, 0), 2); |
||||||
|
imshow(winNameGT, src); |
||||||
|
|
||||||
|
waitKey(); |
||||||
|
} |
||||||
|
} else { |
||||||
|
// Open an image file
|
||||||
|
CV_Assert(parser.has("inputImage")); |
||||||
|
Mat frame = imread(samples::findFile(parser.get<String>("inputImage"))); |
||||||
|
CV_Assert(!frame.empty()); |
||||||
|
|
||||||
|
// Detect
|
||||||
|
std::vector<std::vector<Point>> results; |
||||||
|
detector.detect(frame, results); |
||||||
|
|
||||||
|
polylines(frame, results, true, Scalar(0, 255, 0), 2); |
||||||
|
imshow(winName, frame); |
||||||
|
waitKey(); |
||||||
|
} |
||||||
|
|
||||||
|
return 0; |
||||||
|
} |
@ -0,0 +1,144 @@ |
|||||||
|
#include <iostream> |
||||||
|
#include <fstream> |
||||||
|
|
||||||
|
#include <opencv2/imgproc.hpp> |
||||||
|
#include <opencv2/highgui.hpp> |
||||||
|
#include <opencv2/dnn/dnn.hpp> |
||||||
|
|
||||||
|
using namespace cv; |
||||||
|
using namespace cv::dnn; |
||||||
|
|
||||||
|
String keys = |
||||||
|
"{ help h | | Print help message. }" |
||||||
|
"{ inputImage i | | Path to an input image. Skip this argument to capture frames from a camera. }" |
||||||
|
"{ modelPath mp | | Path to a binary .onnx file contains trained CRNN text recognition model. " |
||||||
|
"Download links are provided in doc/tutorials/dnn/dnn_text_spotting/dnn_text_spotting.markdown}" |
||||||
|
"{ RGBInput rgb |0| 0: imread with flags=IMREAD_GRAYSCALE; 1: imread with flags=IMREAD_COLOR. }" |
||||||
|
"{ evaluate e |false| false: predict with input images; true: evaluate on benchmarks. }" |
||||||
|
"{ evalDataPath edp | | Path to benchmarks for evaluation. " |
||||||
|
"Download links are provided in doc/tutorials/dnn/dnn_text_spotting/dnn_text_spotting.markdown}" |
||||||
|
"{ vocabularyPath vp | alphabet_36.txt | Path to recognition vocabulary. " |
||||||
|
"Download links are provided in doc/tutorials/dnn/dnn_text_spotting/dnn_text_spotting.markdown}"; |
||||||
|
|
||||||
|
String convertForEval(String &input); |
||||||
|
|
||||||
|
int main(int argc, char** argv) |
||||||
|
{ |
||||||
|
// Parse arguments
|
||||||
|
CommandLineParser parser(argc, argv, keys); |
||||||
|
parser.about("Use this script to run the PyTorch implementation of " |
||||||
|
"An End-to-End Trainable Neural Network for Image-based SequenceRecognition and Its Application to Scene Text Recognition " |
||||||
|
"(https://arxiv.org/abs/1507.05717)"); |
||||||
|
if (argc == 1 || parser.has("help")) |
||||||
|
{ |
||||||
|
parser.printMessage(); |
||||||
|
return 0; |
||||||
|
} |
||||||
|
|
||||||
|
String modelPath = parser.get<String>("modelPath"); |
||||||
|
String vocPath = parser.get<String>("vocabularyPath"); |
||||||
|
int imreadRGB = parser.get<int>("RGBInput"); |
||||||
|
|
||||||
|
if (!parser.check()) |
||||||
|
{ |
||||||
|
parser.printErrors(); |
||||||
|
return 1; |
||||||
|
} |
||||||
|
|
||||||
|
// Load the network
|
||||||
|
CV_Assert(!modelPath.empty()); |
||||||
|
TextRecognitionModel recognizer(modelPath); |
||||||
|
|
||||||
|
// Load vocabulary
|
||||||
|
CV_Assert(!vocPath.empty()); |
||||||
|
std::ifstream vocFile; |
||||||
|
vocFile.open(samples::findFile(vocPath)); |
||||||
|
CV_Assert(vocFile.is_open()); |
||||||
|
String vocLine; |
||||||
|
std::vector<String> vocabulary; |
||||||
|
while (std::getline(vocFile, vocLine)) { |
||||||
|
vocabulary.push_back(vocLine); |
||||||
|
} |
||||||
|
recognizer.setVocabulary(vocabulary); |
||||||
|
recognizer.setDecodeType("CTC-greedy"); |
||||||
|
|
||||||
|
// Set parameters
|
||||||
|
double scale = 1.0 / 127.5; |
||||||
|
Scalar mean = Scalar(127.5, 127.5, 127.5); |
||||||
|
Size inputSize = Size(100, 32); |
||||||
|
recognizer.setInputParams(scale, inputSize, mean); |
||||||
|
|
||||||
|
if (parser.get<bool>("evaluate")) |
||||||
|
{ |
||||||
|
// For evaluation
|
||||||
|
String evalDataPath = parser.get<String>("evalDataPath"); |
||||||
|
CV_Assert(!evalDataPath.empty()); |
||||||
|
String gtPath = evalDataPath + "/test_gts.txt"; |
||||||
|
std::ifstream evalGts; |
||||||
|
evalGts.open(gtPath); |
||||||
|
CV_Assert(evalGts.is_open()); |
||||||
|
|
||||||
|
String gtLine; |
||||||
|
int cntRight=0, cntAll=0; |
||||||
|
TickMeter timer; |
||||||
|
timer.reset(); |
||||||
|
|
||||||
|
while (std::getline(evalGts, gtLine)) { |
||||||
|
size_t splitLoc = gtLine.find_first_of(' '); |
||||||
|
String imgPath = evalDataPath + '/' + gtLine.substr(0, splitLoc); |
||||||
|
String gt = gtLine.substr(splitLoc+1); |
||||||
|
|
||||||
|
// Inference
|
||||||
|
Mat frame = imread(samples::findFile(imgPath), imreadRGB); |
||||||
|
CV_Assert(!frame.empty()); |
||||||
|
timer.start(); |
||||||
|
std::string recognitionResult = recognizer.recognize(frame); |
||||||
|
timer.stop(); |
||||||
|
|
||||||
|
if (gt == convertForEval(recognitionResult)) |
||||||
|
cntRight++; |
||||||
|
|
||||||
|
cntAll++; |
||||||
|
} |
||||||
|
std::cout << "Accuracy(%): " << (double)(cntRight) / (double)(cntAll) << std::endl; |
||||||
|
std::cout << "Average Inference Time(ms): " << timer.getTimeMilli() / (double)(cntAll) << std::endl; |
||||||
|
} |
||||||
|
else |
||||||
|
{ |
||||||
|
// Create a window
|
||||||
|
static const std::string winName = "Input Cropped Image"; |
||||||
|
|
||||||
|
// Open an image file
|
||||||
|
CV_Assert(parser.has("inputImage")); |
||||||
|
Mat frame = imread(samples::findFile(parser.get<String>("inputImage")), imreadRGB); |
||||||
|
CV_Assert(!frame.empty()); |
||||||
|
|
||||||
|
// Recognition
|
||||||
|
std::string recognitionResult = recognizer.recognize(frame); |
||||||
|
|
||||||
|
imshow(winName, frame); |
||||||
|
std::cout << "Predition: '" << recognitionResult << "'" << std::endl; |
||||||
|
waitKey(); |
||||||
|
} |
||||||
|
|
||||||
|
return 0; |
||||||
|
} |
||||||
|
|
||||||
|
// Convert the predictions to lower case, and remove other characters.
|
||||||
|
// Only for Evaluation
|
||||||
|
String convertForEval(String & input) |
||||||
|
{ |
||||||
|
String output; |
||||||
|
for (uint i = 0; i < input.length(); i++){ |
||||||
|
char ch = input[i]; |
||||||
|
if ((int)ch >= 97 && (int)ch <= 122) { |
||||||
|
output.push_back(ch); |
||||||
|
} else if ((int)ch >= 65 && (int)ch <= 90) { |
||||||
|
output.push_back((char)(ch + 32)); |
||||||
|
} else { |
||||||
|
continue; |
||||||
|
} |
||||||
|
} |
||||||
|
|
||||||
|
return output; |
||||||
|
} |
@ -0,0 +1,169 @@ |
|||||||
|
#include <iostream> |
||||||
|
#include <fstream> |
||||||
|
|
||||||
|
#include <opencv2/imgproc.hpp> |
||||||
|
#include <opencv2/highgui.hpp> |
||||||
|
#include <opencv2/dnn/dnn.hpp> |
||||||
|
|
||||||
|
using namespace cv; |
||||||
|
using namespace cv::dnn; |
||||||
|
|
||||||
|
std::string keys = |
||||||
|
"{ help h | | Print help message. }" |
||||||
|
"{ inputImage i | | Path to an input image. Skip this argument to capture frames from a camera. }" |
||||||
|
"{ detModelPath dmp | | Path to a binary .onnx model for detection. " |
||||||
|
"Download links are provided in doc/tutorials/dnn/dnn_text_spotting/dnn_text_spotting.markdown}" |
||||||
|
"{ recModelPath rmp | | Path to a binary .onnx model for recognition. " |
||||||
|
"Download links are provided in doc/tutorials/dnn/dnn_text_spotting/dnn_text_spotting.markdown}" |
||||||
|
"{ inputHeight ih |736| image height of the model input. It should be multiple by 32.}" |
||||||
|
"{ inputWidth iw |736| image width of the model input. It should be multiple by 32.}" |
||||||
|
"{ RGBInput rgb |0| 0: imread with flags=IMREAD_GRAYSCALE; 1: imread with flags=IMREAD_COLOR. }" |
||||||
|
"{ binaryThreshold bt |0.3| Confidence threshold of the binary map. }" |
||||||
|
"{ polygonThreshold pt |0.5| Confidence threshold of polygons. }" |
||||||
|
"{ maxCandidate max |200| Max candidates of polygons. }" |
||||||
|
"{ unclipRatio ratio |2.0| unclip ratio. }" |
||||||
|
"{ vocabularyPath vp | alphabet_36.txt | Path to benchmarks for evaluation. " |
||||||
|
"Download links are provided in doc/tutorials/dnn/dnn_text_spotting/dnn_text_spotting.markdown}"; |
||||||
|
|
||||||
|
void fourPointsTransform(const Mat& frame, const Point2f vertices[], Mat& result); |
||||||
|
bool sortPts(const Point& p1, const Point& p2); |
||||||
|
|
||||||
|
int main(int argc, char** argv) |
||||||
|
{ |
||||||
|
// Parse arguments
|
||||||
|
CommandLineParser parser(argc, argv, keys); |
||||||
|
parser.about("Use this script to run an end-to-end inference sample of textDetectionModel and textRecognitionModel APIs\n" |
||||||
|
"Use -h for more information"); |
||||||
|
if (argc == 1 || parser.has("help")) |
||||||
|
{ |
||||||
|
parser.printMessage(); |
||||||
|
return 0; |
||||||
|
} |
||||||
|
|
||||||
|
float binThresh = parser.get<float>("binaryThreshold"); |
||||||
|
float polyThresh = parser.get<float>("polygonThreshold"); |
||||||
|
uint maxCandidates = parser.get<uint>("maxCandidate"); |
||||||
|
String detModelPath = parser.get<String>("detModelPath"); |
||||||
|
String recModelPath = parser.get<String>("recModelPath"); |
||||||
|
String vocPath = parser.get<String>("vocabularyPath"); |
||||||
|
double unclipRatio = parser.get<double>("unclipRatio"); |
||||||
|
int height = parser.get<int>("inputHeight"); |
||||||
|
int width = parser.get<int>("inputWidth"); |
||||||
|
int imreadRGB = parser.get<int>("RGBInput"); |
||||||
|
|
||||||
|
if (!parser.check()) |
||||||
|
{ |
||||||
|
parser.printErrors(); |
||||||
|
return 1; |
||||||
|
} |
||||||
|
|
||||||
|
// Load networks
|
||||||
|
CV_Assert(!detModelPath.empty()); |
||||||
|
TextDetectionModel_DB detector(detModelPath); |
||||||
|
detector.setBinaryThreshold(binThresh) |
||||||
|
.setPolygonThreshold(polyThresh) |
||||||
|
.setUnclipRatio(unclipRatio) |
||||||
|
.setMaxCandidates(maxCandidates); |
||||||
|
|
||||||
|
CV_Assert(!recModelPath.empty()); |
||||||
|
TextRecognitionModel recognizer(recModelPath); |
||||||
|
|
||||||
|
// Load vocabulary
|
||||||
|
CV_Assert(!vocPath.empty()); |
||||||
|
std::ifstream vocFile; |
||||||
|
vocFile.open(samples::findFile(vocPath)); |
||||||
|
CV_Assert(vocFile.is_open()); |
||||||
|
String vocLine; |
||||||
|
std::vector<String> vocabulary; |
||||||
|
while (std::getline(vocFile, vocLine)) { |
||||||
|
vocabulary.push_back(vocLine); |
||||||
|
} |
||||||
|
recognizer.setVocabulary(vocabulary); |
||||||
|
recognizer.setDecodeType("CTC-greedy"); |
||||||
|
|
||||||
|
// Parameters for Detection
|
||||||
|
double detScale = 1.0 / 255.0; |
||||||
|
Size detInputSize = Size(width, height); |
||||||
|
Scalar detMean = Scalar(122.67891434, 116.66876762, 104.00698793); |
||||||
|
detector.setInputParams(detScale, detInputSize, detMean); |
||||||
|
|
||||||
|
// Parameters for Recognition
|
||||||
|
double recScale = 1.0 / 127.5; |
||||||
|
Scalar recMean = Scalar(127.5); |
||||||
|
Size recInputSize = Size(100, 32); |
||||||
|
recognizer.setInputParams(recScale, recInputSize, recMean); |
||||||
|
|
||||||
|
// Create a window
|
||||||
|
static const std::string winName = "Text_Spotting"; |
||||||
|
|
||||||
|
// Input data
|
||||||
|
Mat frame = imread(samples::findFile(parser.get<String>("inputImage"))); |
||||||
|
std::cout << frame.size << std::endl; |
||||||
|
|
||||||
|
// Inference
|
||||||
|
std::vector< std::vector<Point> > detResults; |
||||||
|
detector.detect(frame, detResults); |
||||||
|
|
||||||
|
if (detResults.size() > 0) { |
||||||
|
// Text Recognition
|
||||||
|
Mat recInput; |
||||||
|
if (!imreadRGB) { |
||||||
|
cvtColor(frame, recInput, cv::COLOR_BGR2GRAY); |
||||||
|
} else { |
||||||
|
recInput = frame; |
||||||
|
} |
||||||
|
std::vector< std::vector<Point> > contours; |
||||||
|
for (uint i = 0; i < detResults.size(); i++) |
||||||
|
{ |
||||||
|
const auto& quadrangle = detResults[i]; |
||||||
|
CV_CheckEQ(quadrangle.size(), (size_t)4, ""); |
||||||
|
|
||||||
|
contours.emplace_back(quadrangle); |
||||||
|
|
||||||
|
std::vector<Point2f> quadrangle_2f; |
||||||
|
for (int j = 0; j < 4; j++) |
||||||
|
quadrangle_2f.emplace_back(quadrangle[j]); |
||||||
|
|
||||||
|
// Transform and Crop
|
||||||
|
Mat cropped; |
||||||
|
fourPointsTransform(recInput, &quadrangle_2f[0], cropped); |
||||||
|
|
||||||
|
std::string recognitionResult = recognizer.recognize(cropped); |
||||||
|
std::cout << i << ": '" << recognitionResult << "'" << std::endl; |
||||||
|
|
||||||
|
putText(frame, recognitionResult, quadrangle[3], FONT_HERSHEY_SIMPLEX, 1, Scalar(0, 0, 255), 2); |
||||||
|
} |
||||||
|
polylines(frame, contours, true, Scalar(0, 255, 0), 2); |
||||||
|
} else { |
||||||
|
std::cout << "No Text Detected." << std::endl; |
||||||
|
} |
||||||
|
imshow(winName, frame); |
||||||
|
waitKey(); |
||||||
|
|
||||||
|
return 0; |
||||||
|
} |
||||||
|
|
||||||
|
void fourPointsTransform(const Mat& frame, const Point2f vertices[], Mat& result) |
||||||
|
{ |
||||||
|
const Size outputSize = Size(100, 32); |
||||||
|
|
||||||
|
Point2f targetVertices[4] = { |
||||||
|
Point(0, outputSize.height - 1), |
||||||
|
Point(0, 0), |
||||||
|
Point(outputSize.width - 1, 0), |
||||||
|
Point(outputSize.width - 1, outputSize.height - 1) |
||||||
|
}; |
||||||
|
Mat rotationMatrix = getPerspectiveTransform(vertices, targetVertices); |
||||||
|
|
||||||
|
warpPerspective(frame, result, rotationMatrix, outputSize); |
||||||
|
|
||||||
|
#if 0 |
||||||
|
imshow("roi", result); |
||||||
|
waitKey(); |
||||||
|
#endif |
||||||
|
} |
||||||
|
|
||||||
|
bool sortPts(const Point& p1, const Point& p2) |
||||||
|
{ |
||||||
|
return p1.x < p2.x; |
||||||
|
} |
Loading…
Reference in new issue