mirror of https://github.com/opencv/opencv.git
Merge pull request #17570 from HannibalAPE:text_det_recog_demo
[GSoC] High Level API and Samples for Scene Text Detection and Recognition * APIs and samples for scene text detection and recognition * update APIs and tutorial for Text Detection and Recognition * API updates: (1) put decodeType into struct Voc (2) optimize the post-processing of DB * sample update: (1) add transformation into scene_text_spotting.cpp (2) modify text_detection.cpp with API update * update tutorial * simplify text recognition API update tutorial * update impl usage in recognize() and detect() * dnn: refactoring public API of TextRecognitionModel/TextDetectionModel * update provided models update opencv.bib * dnn: adjust text rectangle angle * remove points ordering operation in model.cpp * update gts of DB test in test_model.cpp * dnn: ensure to keep text rectangle angle - avoid 90/180 degree turns * dnn(text): use quadrangle result in TextDetectionModel API * dnn: update Text Detection API (1) keep points' order consistent with (bl, tl, tr, br) in unclip (2) update contourScore with boundingRectpull/19012/head
parent
5ecf693774
commit
22d64ae08f
19 changed files with 2340 additions and 182 deletions
After Width: | Height: | Size: 40 KiB |
After Width: | Height: | Size: 90 KiB |
@ -0,0 +1,316 @@ |
||||
# High Level API: TextDetectionModel and TextRecognitionModel {#tutorial_dnn_text_spotting} |
||||
|
||||
@prev_tutorial{tutorial_dnn_OCR} |
||||
|
||||
## Introduction |
||||
In this tutorial, we will introduce the APIs for TextRecognitionModel and TextDetectionModel in detail. |
||||
|
||||
--- |
||||
#### TextRecognitionModel: |
||||
|
||||
In the current version, @ref cv::dnn::TextRecognitionModel only supports CNN+RNN+CTC based algorithms, |
||||
and the greedy decoding method for CTC is provided. |
||||
For more information, please refer to the [original paper](https://arxiv.org/abs/1507.05717) |
||||
|
||||
Before recognition, you should `setVocabulary` and `setDecodeType`. |
||||
- "CTC-greedy", the output of the text recognition model should be a probability matrix. |
||||
The shape should be `(T, B, Dim)`, where |
||||
- `T` is the sequence length |
||||
- `B` is the batch size (only support `B=1` in inference) |
||||
- and `Dim` is the length of vocabulary +1('Blank' of CTC is at the index=0 of Dim). |
||||
|
||||
@ref cv::dnn::TextRecognitionModel::recognize() is the main function for text recognition. |
||||
- The input image should be a cropped text image or an image with `roiRects` |
||||
- Other decoding methods may supported in the future |
||||
|
||||
--- |
||||
|
||||
#### TextDetectionModel: |
||||
|
||||
@ref cv::dnn::TextDetectionModel API provides these methods for text detection: |
||||
- cv::dnn::TextDetectionModel::detect() returns the results in std::vector<std::vector<Point>> (4-points quadrangles) |
||||
- cv::dnn::TextDetectionModel::detectTextRectangles() returns the results in std::vector<cv::RotatedRect> (RBOX-like) |
||||
|
||||
In the current version, @ref cv::dnn::TextDetectionModel supports these algorithms: |
||||
- use @ref cv::dnn::TextDetectionModel_DB with "DB" models |
||||
- and use @ref cv::dnn::TextDetectionModel_EAST with "EAST" models |
||||
|
||||
The following provided pretrained models are variants of DB (w/o deformable convolution), |
||||
and the performance can be referred to the Table.1 in the [paper]((https://arxiv.org/abs/1911.08947)). |
||||
For more information, please refer to the [official code](https://github.com/MhLiao/DB) |
||||
|
||||
--- |
||||
|
||||
You can train your own model with more data, and convert it into ONNX format. |
||||
We encourage you to add new algorithms to these APIs. |
||||
|
||||
|
||||
## Pretrained Models |
||||
|
||||
#### TextRecognitionModel: |
||||
|
||||
``` |
||||
crnn.onnx: |
||||
url: https://drive.google.com/uc?export=dowload&id=1ooaLR-rkTl8jdpGy1DoQs0-X0lQsB6Fj |
||||
sha: 270d92c9ccb670ada2459a25977e8deeaf8380d3, |
||||
alphabet_36.txt: https://drive.google.com/uc?export=dowload&id=1oPOYx5rQRp8L6XQciUwmwhMCfX0KyO4b |
||||
parameter setting: -rgb=0; |
||||
description: The classification number of this model is 36 (0~9 + a~z). |
||||
The training dataset is MJSynth. |
||||
|
||||
crnn_cs.onnx: |
||||
url: https://drive.google.com/uc?export=dowload&id=12diBsVJrS9ZEl6BNUiRp9s0xPALBS7kt |
||||
sha: a641e9c57a5147546f7a2dbea4fd322b47197cd5 |
||||
alphabet_94.txt: https://drive.google.com/uc?export=dowload&id=1oKXxXKusquimp7XY1mFvj9nwLzldVgBR |
||||
parameter setting: -rgb=1; |
||||
description: The classification number of this model is 94 (0~9 + a~z + A~Z + punctuations). |
||||
The training datasets are MJsynth and SynthText. |
||||
|
||||
crnn_cs_CN.onnx: |
||||
url: https://drive.google.com/uc?export=dowload&id=1is4eYEUKH7HR7Gl37Sw4WPXx6Ir8oQEG |
||||
sha: 3940942b85761c7f240494cf662dcbf05dc00d14 |
||||
alphabet_3944.txt: https://drive.google.com/uc?export=dowload&id=18IZUUdNzJ44heWTndDO6NNfIpJMmN-ul |
||||
parameter setting: -rgb=1; |
||||
description: The classification number of this model is 3944 (0~9 + a~z + A~Z + Chinese characters + special characters). |
||||
The training dataset is ReCTS (https://rrc.cvc.uab.es/?ch=12). |
||||
``` |
||||
|
||||
More models can be found in [here](https://drive.google.com/drive/folders/1cTbQ3nuZG-EKWak6emD_s8_hHXWz7lAr?usp=sharing), |
||||
which are taken from [clovaai](https://github.com/clovaai/deep-text-recognition-benchmark). |
||||
You can train more models by [CRNN](https://github.com/meijieru/crnn.pytorch), and convert models by `torch.onnx.export`. |
||||
|
||||
#### TextDetectionModel: |
||||
|
||||
``` |
||||
- DB_IC15_resnet50.onnx: |
||||
url: https://drive.google.com/uc?export=dowload&id=17_ABp79PlFt9yPCxSaarVc_DKTmrSGGf |
||||
sha: bef233c28947ef6ec8c663d20a2b326302421fa3 |
||||
recommended parameter setting: -inputHeight=736, -inputWidth=1280; |
||||
description: This model is trained on ICDAR2015, so it can only detect English text instances. |
||||
|
||||
- DB_IC15_resnet18.onnx: |
||||
url: https://drive.google.com/uc?export=dowload&id=1sZszH3pEt8hliyBlTmB-iulxHP1dCQWV |
||||
sha: 19543ce09b2efd35f49705c235cc46d0e22df30b |
||||
recommended parameter setting: -inputHeight=736, -inputWidth=1280; |
||||
description: This model is trained on ICDAR2015, so it can only detect English text instances. |
||||
|
||||
- DB_TD500_resnet50.onnx: |
||||
url: https://drive.google.com/uc?export=dowload&id=19YWhArrNccaoSza0CfkXlA8im4-lAGsR |
||||
sha: 1b4dd21a6baa5e3523156776970895bd3db6960a |
||||
recommended parameter setting: -inputHeight=736, -inputWidth=736; |
||||
description: This model is trained on MSRA-TD500, so it can detect both English and Chinese text instances. |
||||
|
||||
- DB_TD500_resnet18.onnx: |
||||
url: https://drive.google.com/uc?export=dowload&id=1vY_KsDZZZb_svd5RT6pjyI8BS1nPbBSX |
||||
sha: 8a3700bdc13e00336a815fc7afff5dcc1ce08546 |
||||
recommended parameter setting: -inputHeight=736, -inputWidth=736; |
||||
description: This model is trained on MSRA-TD500, so it can detect both English and Chinese text instances. |
||||
|
||||
``` |
||||
|
||||
We will release more models of DB [here](https://drive.google.com/drive/folders/1qzNCHfUJOS0NEUOIKn69eCtxdlNPpWbq?usp=sharing) in the future. |
||||
|
||||
``` |
||||
- EAST: |
||||
Download link: https://www.dropbox.com/s/r2ingd0l3zt8hxs/frozen_east_text_detection.tar.gz?dl=1 |
||||
This model is based on https://github.com/argman/EAST |
||||
``` |
||||
|
||||
## Images for Testing |
||||
|
||||
``` |
||||
Text Recognition: |
||||
url: https://drive.google.com/uc?export=dowload&id=1nMcEy68zDNpIlqAn6xCk_kYcUTIeSOtN |
||||
sha: 89205612ce8dd2251effa16609342b69bff67ca3 |
||||
|
||||
Text Detection: |
||||
url: https://drive.google.com/uc?export=dowload&id=149tAhIcvfCYeyufRoZ9tmc2mZDKE_XrF |
||||
sha: ced3c03fb7f8d9608169a913acf7e7b93e07109b |
||||
``` |
||||
|
||||
## Example for Text Recognition |
||||
|
||||
Step1. Loading images and models with a vocabulary |
||||
|
||||
```cpp |
||||
// Load a cropped text line image |
||||
// you can find cropped images for testing in "Images for Testing" |
||||
int rgb = IMREAD_COLOR; // This should be changed according to the model input requirement. |
||||
Mat image = imread("path/to/text_rec_test.png", rgb); |
||||
|
||||
// Load models weights |
||||
TextRecognitionModel model("path/to/crnn_cs.onnx"); |
||||
|
||||
// The decoding method |
||||
// more methods will be supported in future |
||||
model.setDecodeType("CTC-greedy"); |
||||
|
||||
// Load vocabulary |
||||
// vocabulary should be changed according to the text recognition model |
||||
std::ifstream vocFile; |
||||
vocFile.open("path/to/alphabet_94.txt"); |
||||
CV_Assert(vocFile.is_open()); |
||||
String vocLine; |
||||
std::vector<String> vocabulary; |
||||
while (std::getline(vocFile, vocLine)) { |
||||
vocabulary.push_back(vocLine); |
||||
} |
||||
model.setVocabulary(vocabulary); |
||||
``` |
||||
|
||||
Step2. Setting Parameters |
||||
|
||||
```cpp |
||||
// Normalization parameters |
||||
double scale = 1.0 / 127.5; |
||||
Scalar mean = Scalar(127.5, 127.5, 127.5); |
||||
|
||||
// The input shape |
||||
Size inputSize = Size(100, 32); |
||||
|
||||
model.setInputParams(scale, inputSize, mean); |
||||
``` |
||||
Step3. Inference |
||||
```cpp |
||||
std::string recognitionResult = recognizer.recognize(image); |
||||
std::cout << "'" << recognitionResult << "'" << std::endl; |
||||
``` |
||||
|
||||
Input image: |
||||
|
||||
![Picture example](text_rec_test.png) |
||||
|
||||
Output: |
||||
``` |
||||
'welcome' |
||||
``` |
||||
|
||||
|
||||
## Example for Text Detection |
||||
|
||||
Step1. Loading images and models |
||||
```cpp |
||||
// Load an image |
||||
// you can find some images for testing in "Images for Testing" |
||||
Mat frame = imread("/path/to/text_det_test.png"); |
||||
``` |
||||
|
||||
Step2.a Setting Parameters (DB) |
||||
```cpp |
||||
// Load model weights |
||||
TextDetectionModel_DB model("/path/to/DB_TD500_resnet50.onnx"); |
||||
|
||||
// Post-processing parameters |
||||
float binThresh = 0.3; |
||||
float polyThresh = 0.5; |
||||
uint maxCandidates = 200; |
||||
double unclipRatio = 2.0; |
||||
model.setBinaryThreshold(binThresh) |
||||
.setPolygonThreshold(polyThresh) |
||||
.setMaxCandidates(maxCandidates) |
||||
.setUnclipRatio(unclipRatio) |
||||
; |
||||
|
||||
// Normalization parameters |
||||
double scale = 1.0 / 255.0; |
||||
Scalar mean = Scalar(122.67891434, 116.66876762, 104.00698793); |
||||
|
||||
// The input shape |
||||
Size inputSize = Size(736, 736); |
||||
|
||||
model.setInputParams(scale, inputSize, mean); |
||||
``` |
||||
|
||||
Step2.b Setting Parameters (EAST) |
||||
```cpp |
||||
TextDetectionModel_EAST model("EAST.pb"); |
||||
|
||||
float confThreshold = 0.5; |
||||
float nmsThreshold = 0.4; |
||||
model.setConfidenceThreshold(confThresh) |
||||
.setNMSThreshold(nmsThresh) |
||||
; |
||||
|
||||
double detScale = 1.0; |
||||
Size detInputSize = Size(320, 320); |
||||
Scalar detMean = Scalar(123.68, 116.78, 103.94); |
||||
bool swapRB = true; |
||||
model.setInputParams(detScale, detInputSize, detMean, swapRB); |
||||
``` |
||||
|
||||
|
||||
Step3. Inference |
||||
```cpp |
||||
std::vector<std::vector<Point>> detResults; |
||||
model.detect(detResults); |
||||
|
||||
// Visualization |
||||
polylines(frame, results, true, Scalar(0, 255, 0), 2); |
||||
imshow("Text Detection", image); |
||||
waitKey(); |
||||
``` |
||||
|
||||
Output: |
||||
|
||||
![Picture example](text_det_test_results.jpg) |
||||
|
||||
## Example for Text Spotting |
||||
|
||||
After following the steps above, it is easy to get the detection results of an input image. |
||||
Then, you can do transformation and crop text images for recognition. |
||||
For more information, please refer to **Detailed Sample** |
||||
```cpp |
||||
// Transform and Crop |
||||
Mat cropped; |
||||
fourPointsTransform(recInput, vertices, cropped); |
||||
|
||||
String recResult = recognizer.recognize(cropped); |
||||
``` |
||||
|
||||
Output Examples: |
||||
|
||||
![Picture example](detect_test1.jpg) |
||||
|
||||
![Picture example](detect_test2.jpg) |
||||
|
||||
## Source Code |
||||
The [source code](https://github.com/opencv/opencv/blob/master/modules/dnn/src/model.cpp) |
||||
of these APIs can be found in the DNN module. |
||||
|
||||
## Detailed Sample |
||||
For more information, please refer to: |
||||
- [samples/dnn/scene_text_recognition.cpp](https://github.com/opencv/opencv/blob/master/samples/dnn/scene_text_recognition.cpp) |
||||
- [samples/dnn/scene_text_detection.cpp](https://github.com/opencv/opencv/blob/master/samples/dnn/scene_text_detection.cpp) |
||||
- [samples/dnn/text_detection.cpp](https://github.com/opencv/opencv/blob/master/samples/dnn/text_detection.cpp) |
||||
- [samples/dnn/scene_text_spotting.cpp](https://github.com/opencv/opencv/blob/master/samples/dnn/scene_text_spotting.cpp) |
||||
|
||||
#### Test with an image |
||||
Examples: |
||||
```bash |
||||
example_dnn_scene_text_recognition -mp=path/to/crnn_cs.onnx -i=path/to/an/image -rgb=1 -vp=/path/to/alphabet_94.txt |
||||
example_dnn_scene_text_detection -mp=path/to/DB_TD500_resnet50.onnx -i=path/to/an/image -ih=736 -iw=736 |
||||
example_dnn_scene_text_spotting -dmp=path/to/DB_IC15_resnet50.onnx -rmp=path/to/crnn_cs.onnx -i=path/to/an/image -iw=1280 -ih=736 -rgb=1 -vp=/path/to/alphabet_94.txt |
||||
example_dnn_text_detection -dmp=path/to/EAST.pb -rmp=path/to/crnn_cs.onnx -i=path/to/an/image -rgb=1 -vp=path/to/alphabet_94.txt |
||||
``` |
||||
|
||||
#### Test on public datasets |
||||
Text Recognition: |
||||
|
||||
The download link for testing images can be found in the **Images for Testing** |
||||
|
||||
|
||||
Examples: |
||||
```bash |
||||
example_dnn_scene_text_recognition -mp=path/to/crnn.onnx -e=true -edp=path/to/evaluation_data_rec -vp=/path/to/alphabet_36.txt -rgb=0 |
||||
example_dnn_scene_text_recognition -mp=path/to/crnn_cs.onnx -e=true -edp=path/to/evaluation_data_rec -vp=/path/to/alphabet_94.txt -rgb=1 |
||||
``` |
||||
|
||||
Text Detection: |
||||
|
||||
The download links for testing images can be found in the **Images for Testing** |
||||
|
||||
Examples: |
||||
```bash |
||||
example_dnn_scene_text_detection -mp=path/to/DB_TD500_resnet50.onnx -e=true -edp=path/to/evaluation_data_det/TD500 -ih=736 -iw=736 |
||||
example_dnn_scene_text_detection -mp=path/to/DB_IC15_resnet50.onnx -e=true -edp=path/to/evaluation_data_det/IC15 -ih=736 -iw=1280 |
||||
``` |
After Width: | Height: | Size: 48 KiB |
After Width: | Height: | Size: 2.8 KiB |
@ -0,0 +1,36 @@ |
||||
0 |
||||
1 |
||||
2 |
||||
3 |
||||
4 |
||||
5 |
||||
6 |
||||
7 |
||||
8 |
||||
9 |
||||
a |
||||
b |
||||
c |
||||
d |
||||
e |
||||
f |
||||
g |
||||
h |
||||
i |
||||
j |
||||
k |
||||
l |
||||
m |
||||
n |
||||
o |
||||
p |
||||
q |
||||
r |
||||
s |
||||
t |
||||
u |
||||
v |
||||
w |
||||
x |
||||
y |
||||
z |
@ -0,0 +1,94 @@ |
||||
0 |
||||
1 |
||||
2 |
||||
3 |
||||
4 |
||||
5 |
||||
6 |
||||
7 |
||||
8 |
||||
9 |
||||
a |
||||
b |
||||
c |
||||
d |
||||
e |
||||
f |
||||
g |
||||
h |
||||
i |
||||
j |
||||
k |
||||
l |
||||
m |
||||
n |
||||
o |
||||
p |
||||
q |
||||
r |
||||
s |
||||
t |
||||
u |
||||
v |
||||
w |
||||
x |
||||
y |
||||
z |
||||
A |
||||
B |
||||
C |
||||
D |
||||
E |
||||
F |
||||
G |
||||
H |
||||
I |
||||
J |
||||
K |
||||
L |
||||
M |
||||
N |
||||
O |
||||
P |
||||
Q |
||||
R |
||||
S |
||||
T |
||||
U |
||||
V |
||||
W |
||||
X |
||||
Y |
||||
Z |
||||
! |
||||
" |
||||
# |
||||
$ |
||||
% |
||||
& |
||||
' |
||||
( |
||||
) |
||||
* |
||||
+ |
||||
, |
||||
- |
||||
. |
||||
/ |
||||
: |
||||
; |
||||
< |
||||
= |
||||
> |
||||
? |
||||
@ |
||||
[ |
||||
\ |
||||
] |
||||
^ |
||||
_ |
||||
` |
||||
{ |
||||
| |
||||
} |
||||
~ |
@ -0,0 +1,151 @@ |
||||
#include <iostream> |
||||
#include <fstream> |
||||
#include <regex> |
||||
|
||||
#include <opencv2/imgproc.hpp> |
||||
#include <opencv2/highgui.hpp> |
||||
#include <opencv2/dnn/dnn.hpp> |
||||
|
||||
using namespace cv; |
||||
using namespace cv::dnn; |
||||
|
||||
std::string keys = |
||||
"{ help h | | Print help message. }" |
||||
"{ inputImage i | | Path to an input image. Skip this argument to capture frames from a camera. }" |
||||
"{ modelPath mp | | Path to a binary .onnx file contains trained DB detector model. " |
||||
"Download links are provided in doc/tutorials/dnn/dnn_text_spotting/dnn_text_spotting.markdown}" |
||||
"{ inputHeight ih |736| image height of the model input. It should be multiple by 32.}" |
||||
"{ inputWidth iw |736| image width of the model input. It should be multiple by 32.}" |
||||
"{ binaryThreshold bt |0.3| Confidence threshold of the binary map. }" |
||||
"{ polygonThreshold pt |0.5| Confidence threshold of polygons. }" |
||||
"{ maxCandidate max |200| Max candidates of polygons. }" |
||||
"{ unclipRatio ratio |2.0| unclip ratio. }" |
||||
"{ evaluate e |false| false: predict with input images; true: evaluate on benchmarks. }" |
||||
"{ evalDataPath edp | | Path to benchmarks for evaluation. " |
||||
"Download links are provided in doc/tutorials/dnn/dnn_text_spotting/dnn_text_spotting.markdown}"; |
||||
|
||||
int main(int argc, char** argv) |
||||
{ |
||||
// Parse arguments
|
||||
CommandLineParser parser(argc, argv, keys); |
||||
parser.about("Use this script to run the official PyTorch implementation (https://github.com/MhLiao/DB) of " |
||||
"Real-time Scene Text Detection with Differentiable Binarization (https://arxiv.org/abs/1911.08947)\n" |
||||
"The current version of this script is a variant of the original network without deformable convolution"); |
||||
if (argc == 1 || parser.has("help")) |
||||
{ |
||||
parser.printMessage(); |
||||
return 0; |
||||
} |
||||
|
||||
float binThresh = parser.get<float>("binaryThreshold"); |
||||
float polyThresh = parser.get<float>("polygonThreshold"); |
||||
uint maxCandidates = parser.get<uint>("maxCandidate"); |
||||
String modelPath = parser.get<String>("modelPath"); |
||||
double unclipRatio = parser.get<double>("unclipRatio"); |
||||
int height = parser.get<int>("inputHeight"); |
||||
int width = parser.get<int>("inputWidth"); |
||||
|
||||
if (!parser.check()) |
||||
{ |
||||
parser.printErrors(); |
||||
return 1; |
||||
} |
||||
|
||||
// Load the network
|
||||
CV_Assert(!modelPath.empty()); |
||||
TextDetectionModel_DB detector(modelPath); |
||||
detector.setBinaryThreshold(binThresh) |
||||
.setPolygonThreshold(polyThresh) |
||||
.setUnclipRatio(unclipRatio) |
||||
.setMaxCandidates(maxCandidates); |
||||
|
||||
double scale = 1.0 / 255.0; |
||||
Size inputSize = Size(width, height); |
||||
Scalar mean = Scalar(122.67891434, 116.66876762, 104.00698793); |
||||
detector.setInputParams(scale, inputSize, mean); |
||||
|
||||
// Create a window
|
||||
static const std::string winName = "TextDetectionModel"; |
||||
|
||||
if (parser.get<bool>("evaluate")) { |
||||
// for evaluation
|
||||
String evalDataPath = parser.get<String>("evalDataPath"); |
||||
CV_Assert(!evalDataPath.empty()); |
||||
String testListPath = evalDataPath + "/test_list.txt"; |
||||
std::ifstream testList; |
||||
testList.open(testListPath); |
||||
CV_Assert(testList.is_open()); |
||||
|
||||
// Create a window for showing groundtruth
|
||||
static const std::string winNameGT = "GT"; |
||||
|
||||
String testImgPath; |
||||
while (std::getline(testList, testImgPath)) { |
||||
String imgPath = evalDataPath + "/test_images/" + testImgPath; |
||||
std::cout << "Image Path: " << imgPath << std::endl; |
||||
|
||||
Mat frame = imread(samples::findFile(imgPath), IMREAD_COLOR); |
||||
CV_Assert(!frame.empty()); |
||||
Mat src = frame.clone(); |
||||
|
||||
// Inference
|
||||
std::vector<std::vector<Point>> results; |
||||
detector.detect(frame, results); |
||||
|
||||
polylines(frame, results, true, Scalar(0, 255, 0), 2); |
||||
imshow(winName, frame); |
||||
|
||||
// load groundtruth
|
||||
String imgName = testImgPath.substr(0, testImgPath.length() - 4); |
||||
String gtPath = evalDataPath + "/test_gts/" + imgName + ".txt"; |
||||
// std::cout << gtPath << std::endl;
|
||||
std::ifstream gtFile; |
||||
gtFile.open(gtPath); |
||||
CV_Assert(gtFile.is_open()); |
||||
|
||||
std::vector<std::vector<Point>> gts; |
||||
String gtLine; |
||||
while (std::getline(gtFile, gtLine)) { |
||||
size_t splitLoc = gtLine.find_last_of(','); |
||||
String text = gtLine.substr(splitLoc+1); |
||||
if ( text == "###\r" || text == "1") { |
||||
// ignore difficult instances
|
||||
continue; |
||||
} |
||||
gtLine = gtLine.substr(0, splitLoc); |
||||
|
||||
std::regex delimiter(","); |
||||
std::vector<String> v(std::sregex_token_iterator(gtLine.begin(), gtLine.end(), delimiter, -1), |
||||
std::sregex_token_iterator()); |
||||
std::vector<int> loc; |
||||
std::vector<Point> pts; |
||||
for (auto && s : v) { |
||||
loc.push_back(atoi(s.c_str())); |
||||
} |
||||
for (size_t i = 0; i < loc.size() / 2; i++) { |
||||
pts.push_back(Point(loc[2 * i], loc[2 * i + 1])); |
||||
} |
||||
gts.push_back(pts); |
||||
} |
||||
polylines(src, gts, true, Scalar(0, 255, 0), 2); |
||||
imshow(winNameGT, src); |
||||
|
||||
waitKey(); |
||||
} |
||||
} else { |
||||
// Open an image file
|
||||
CV_Assert(parser.has("inputImage")); |
||||
Mat frame = imread(samples::findFile(parser.get<String>("inputImage"))); |
||||
CV_Assert(!frame.empty()); |
||||
|
||||
// Detect
|
||||
std::vector<std::vector<Point>> results; |
||||
detector.detect(frame, results); |
||||
|
||||
polylines(frame, results, true, Scalar(0, 255, 0), 2); |
||||
imshow(winName, frame); |
||||
waitKey(); |
||||
} |
||||
|
||||
return 0; |
||||
} |
@ -0,0 +1,144 @@ |
||||
#include <iostream> |
||||
#include <fstream> |
||||
|
||||
#include <opencv2/imgproc.hpp> |
||||
#include <opencv2/highgui.hpp> |
||||
#include <opencv2/dnn/dnn.hpp> |
||||
|
||||
using namespace cv; |
||||
using namespace cv::dnn; |
||||
|
||||
String keys = |
||||
"{ help h | | Print help message. }" |
||||
"{ inputImage i | | Path to an input image. Skip this argument to capture frames from a camera. }" |
||||
"{ modelPath mp | | Path to a binary .onnx file contains trained CRNN text recognition model. " |
||||
"Download links are provided in doc/tutorials/dnn/dnn_text_spotting/dnn_text_spotting.markdown}" |
||||
"{ RGBInput rgb |0| 0: imread with flags=IMREAD_GRAYSCALE; 1: imread with flags=IMREAD_COLOR. }" |
||||
"{ evaluate e |false| false: predict with input images; true: evaluate on benchmarks. }" |
||||
"{ evalDataPath edp | | Path to benchmarks for evaluation. " |
||||
"Download links are provided in doc/tutorials/dnn/dnn_text_spotting/dnn_text_spotting.markdown}" |
||||
"{ vocabularyPath vp | alphabet_36.txt | Path to recognition vocabulary. " |
||||
"Download links are provided in doc/tutorials/dnn/dnn_text_spotting/dnn_text_spotting.markdown}"; |
||||
|
||||
String convertForEval(String &input); |
||||
|
||||
int main(int argc, char** argv) |
||||
{ |
||||
// Parse arguments
|
||||
CommandLineParser parser(argc, argv, keys); |
||||
parser.about("Use this script to run the PyTorch implementation of " |
||||
"An End-to-End Trainable Neural Network for Image-based SequenceRecognition and Its Application to Scene Text Recognition " |
||||
"(https://arxiv.org/abs/1507.05717)"); |
||||
if (argc == 1 || parser.has("help")) |
||||
{ |
||||
parser.printMessage(); |
||||
return 0; |
||||
} |
||||
|
||||
String modelPath = parser.get<String>("modelPath"); |
||||
String vocPath = parser.get<String>("vocabularyPath"); |
||||
int imreadRGB = parser.get<int>("RGBInput"); |
||||
|
||||
if (!parser.check()) |
||||
{ |
||||
parser.printErrors(); |
||||
return 1; |
||||
} |
||||
|
||||
// Load the network
|
||||
CV_Assert(!modelPath.empty()); |
||||
TextRecognitionModel recognizer(modelPath); |
||||
|
||||
// Load vocabulary
|
||||
CV_Assert(!vocPath.empty()); |
||||
std::ifstream vocFile; |
||||
vocFile.open(samples::findFile(vocPath)); |
||||
CV_Assert(vocFile.is_open()); |
||||
String vocLine; |
||||
std::vector<String> vocabulary; |
||||
while (std::getline(vocFile, vocLine)) { |
||||
vocabulary.push_back(vocLine); |
||||
} |
||||
recognizer.setVocabulary(vocabulary); |
||||
recognizer.setDecodeType("CTC-greedy"); |
||||
|
||||
// Set parameters
|
||||
double scale = 1.0 / 127.5; |
||||
Scalar mean = Scalar(127.5, 127.5, 127.5); |
||||
Size inputSize = Size(100, 32); |
||||
recognizer.setInputParams(scale, inputSize, mean); |
||||
|
||||
if (parser.get<bool>("evaluate")) |
||||
{ |
||||
// For evaluation
|
||||
String evalDataPath = parser.get<String>("evalDataPath"); |
||||
CV_Assert(!evalDataPath.empty()); |
||||
String gtPath = evalDataPath + "/test_gts.txt"; |
||||
std::ifstream evalGts; |
||||
evalGts.open(gtPath); |
||||
CV_Assert(evalGts.is_open()); |
||||
|
||||
String gtLine; |
||||
int cntRight=0, cntAll=0; |
||||
TickMeter timer; |
||||
timer.reset(); |
||||
|
||||
while (std::getline(evalGts, gtLine)) { |
||||
size_t splitLoc = gtLine.find_first_of(' '); |
||||
String imgPath = evalDataPath + '/' + gtLine.substr(0, splitLoc); |
||||
String gt = gtLine.substr(splitLoc+1); |
||||
|
||||
// Inference
|
||||
Mat frame = imread(samples::findFile(imgPath), imreadRGB); |
||||
CV_Assert(!frame.empty()); |
||||
timer.start(); |
||||
std::string recognitionResult = recognizer.recognize(frame); |
||||
timer.stop(); |
||||
|
||||
if (gt == convertForEval(recognitionResult)) |
||||
cntRight++; |
||||
|
||||
cntAll++; |
||||
} |
||||
std::cout << "Accuracy(%): " << (double)(cntRight) / (double)(cntAll) << std::endl; |
||||
std::cout << "Average Inference Time(ms): " << timer.getTimeMilli() / (double)(cntAll) << std::endl; |
||||
} |
||||
else |
||||
{ |
||||
// Create a window
|
||||
static const std::string winName = "Input Cropped Image"; |
||||
|
||||
// Open an image file
|
||||
CV_Assert(parser.has("inputImage")); |
||||
Mat frame = imread(samples::findFile(parser.get<String>("inputImage")), imreadRGB); |
||||
CV_Assert(!frame.empty()); |
||||
|
||||
// Recognition
|
||||
std::string recognitionResult = recognizer.recognize(frame); |
||||
|
||||
imshow(winName, frame); |
||||
std::cout << "Predition: '" << recognitionResult << "'" << std::endl; |
||||
waitKey(); |
||||
} |
||||
|
||||
return 0; |
||||
} |
||||
|
||||
// Convert the predictions to lower case, and remove other characters.
|
||||
// Only for Evaluation
|
||||
String convertForEval(String & input) |
||||
{ |
||||
String output; |
||||
for (uint i = 0; i < input.length(); i++){ |
||||
char ch = input[i]; |
||||
if ((int)ch >= 97 && (int)ch <= 122) { |
||||
output.push_back(ch); |
||||
} else if ((int)ch >= 65 && (int)ch <= 90) { |
||||
output.push_back((char)(ch + 32)); |
||||
} else { |
||||
continue; |
||||
} |
||||
} |
||||
|
||||
return output; |
||||
} |
@ -0,0 +1,169 @@ |
||||
#include <iostream> |
||||
#include <fstream> |
||||
|
||||
#include <opencv2/imgproc.hpp> |
||||
#include <opencv2/highgui.hpp> |
||||
#include <opencv2/dnn/dnn.hpp> |
||||
|
||||
using namespace cv; |
||||
using namespace cv::dnn; |
||||
|
||||
std::string keys = |
||||
"{ help h | | Print help message. }" |
||||
"{ inputImage i | | Path to an input image. Skip this argument to capture frames from a camera. }" |
||||
"{ detModelPath dmp | | Path to a binary .onnx model for detection. " |
||||
"Download links are provided in doc/tutorials/dnn/dnn_text_spotting/dnn_text_spotting.markdown}" |
||||
"{ recModelPath rmp | | Path to a binary .onnx model for recognition. " |
||||
"Download links are provided in doc/tutorials/dnn/dnn_text_spotting/dnn_text_spotting.markdown}" |
||||
"{ inputHeight ih |736| image height of the model input. It should be multiple by 32.}" |
||||
"{ inputWidth iw |736| image width of the model input. It should be multiple by 32.}" |
||||
"{ RGBInput rgb |0| 0: imread with flags=IMREAD_GRAYSCALE; 1: imread with flags=IMREAD_COLOR. }" |
||||
"{ binaryThreshold bt |0.3| Confidence threshold of the binary map. }" |
||||
"{ polygonThreshold pt |0.5| Confidence threshold of polygons. }" |
||||
"{ maxCandidate max |200| Max candidates of polygons. }" |
||||
"{ unclipRatio ratio |2.0| unclip ratio. }" |
||||
"{ vocabularyPath vp | alphabet_36.txt | Path to benchmarks for evaluation. " |
||||
"Download links are provided in doc/tutorials/dnn/dnn_text_spotting/dnn_text_spotting.markdown}"; |
||||
|
||||
void fourPointsTransform(const Mat& frame, const Point2f vertices[], Mat& result); |
||||
bool sortPts(const Point& p1, const Point& p2); |
||||
|
||||
int main(int argc, char** argv) |
||||
{ |
||||
// Parse arguments
|
||||
CommandLineParser parser(argc, argv, keys); |
||||
parser.about("Use this script to run an end-to-end inference sample of textDetectionModel and textRecognitionModel APIs\n" |
||||
"Use -h for more information"); |
||||
if (argc == 1 || parser.has("help")) |
||||
{ |
||||
parser.printMessage(); |
||||
return 0; |
||||
} |
||||
|
||||
float binThresh = parser.get<float>("binaryThreshold"); |
||||
float polyThresh = parser.get<float>("polygonThreshold"); |
||||
uint maxCandidates = parser.get<uint>("maxCandidate"); |
||||
String detModelPath = parser.get<String>("detModelPath"); |
||||
String recModelPath = parser.get<String>("recModelPath"); |
||||
String vocPath = parser.get<String>("vocabularyPath"); |
||||
double unclipRatio = parser.get<double>("unclipRatio"); |
||||
int height = parser.get<int>("inputHeight"); |
||||
int width = parser.get<int>("inputWidth"); |
||||
int imreadRGB = parser.get<int>("RGBInput"); |
||||
|
||||
if (!parser.check()) |
||||
{ |
||||
parser.printErrors(); |
||||
return 1; |
||||
} |
||||
|
||||
// Load networks
|
||||
CV_Assert(!detModelPath.empty()); |
||||
TextDetectionModel_DB detector(detModelPath); |
||||
detector.setBinaryThreshold(binThresh) |
||||
.setPolygonThreshold(polyThresh) |
||||
.setUnclipRatio(unclipRatio) |
||||
.setMaxCandidates(maxCandidates); |
||||
|
||||
CV_Assert(!recModelPath.empty()); |
||||
TextRecognitionModel recognizer(recModelPath); |
||||
|
||||
// Load vocabulary
|
||||
CV_Assert(!vocPath.empty()); |
||||
std::ifstream vocFile; |
||||
vocFile.open(samples::findFile(vocPath)); |
||||
CV_Assert(vocFile.is_open()); |
||||
String vocLine; |
||||
std::vector<String> vocabulary; |
||||
while (std::getline(vocFile, vocLine)) { |
||||
vocabulary.push_back(vocLine); |
||||
} |
||||
recognizer.setVocabulary(vocabulary); |
||||
recognizer.setDecodeType("CTC-greedy"); |
||||
|
||||
// Parameters for Detection
|
||||
double detScale = 1.0 / 255.0; |
||||
Size detInputSize = Size(width, height); |
||||
Scalar detMean = Scalar(122.67891434, 116.66876762, 104.00698793); |
||||
detector.setInputParams(detScale, detInputSize, detMean); |
||||
|
||||
// Parameters for Recognition
|
||||
double recScale = 1.0 / 127.5; |
||||
Scalar recMean = Scalar(127.5); |
||||
Size recInputSize = Size(100, 32); |
||||
recognizer.setInputParams(recScale, recInputSize, recMean); |
||||
|
||||
// Create a window
|
||||
static const std::string winName = "Text_Spotting"; |
||||
|
||||
// Input data
|
||||
Mat frame = imread(samples::findFile(parser.get<String>("inputImage"))); |
||||
std::cout << frame.size << std::endl; |
||||
|
||||
// Inference
|
||||
std::vector< std::vector<Point> > detResults; |
||||
detector.detect(frame, detResults); |
||||
|
||||
if (detResults.size() > 0) { |
||||
// Text Recognition
|
||||
Mat recInput; |
||||
if (!imreadRGB) { |
||||
cvtColor(frame, recInput, cv::COLOR_BGR2GRAY); |
||||
} else { |
||||
recInput = frame; |
||||
} |
||||
std::vector< std::vector<Point> > contours; |
||||
for (uint i = 0; i < detResults.size(); i++) |
||||
{ |
||||
const auto& quadrangle = detResults[i]; |
||||
CV_CheckEQ(quadrangle.size(), (size_t)4, ""); |
||||
|
||||
contours.emplace_back(quadrangle); |
||||
|
||||
std::vector<Point2f> quadrangle_2f; |
||||
for (int j = 0; j < 4; j++) |
||||
quadrangle_2f.emplace_back(quadrangle[j]); |
||||
|
||||
// Transform and Crop
|
||||
Mat cropped; |
||||
fourPointsTransform(recInput, &quadrangle_2f[0], cropped); |
||||
|
||||
std::string recognitionResult = recognizer.recognize(cropped); |
||||
std::cout << i << ": '" << recognitionResult << "'" << std::endl; |
||||
|
||||
putText(frame, recognitionResult, quadrangle[3], FONT_HERSHEY_SIMPLEX, 1, Scalar(0, 0, 255), 2); |
||||
} |
||||
polylines(frame, contours, true, Scalar(0, 255, 0), 2); |
||||
} else { |
||||
std::cout << "No Text Detected." << std::endl; |
||||
} |
||||
imshow(winName, frame); |
||||
waitKey(); |
||||
|
||||
return 0; |
||||
} |
||||
|
||||
void fourPointsTransform(const Mat& frame, const Point2f vertices[], Mat& result) |
||||
{ |
||||
const Size outputSize = Size(100, 32); |
||||
|
||||
Point2f targetVertices[4] = { |
||||
Point(0, outputSize.height - 1), |
||||
Point(0, 0), |
||||
Point(outputSize.width - 1, 0), |
||||
Point(outputSize.width - 1, outputSize.height - 1) |
||||
}; |
||||
Mat rotationMatrix = getPerspectiveTransform(vertices, targetVertices); |
||||
|
||||
warpPerspective(frame, result, rotationMatrix, outputSize); |
||||
|
||||
#if 0 |
||||
imshow("roi", result); |
||||
waitKey(); |
||||
#endif |
||||
} |
||||
|
||||
bool sortPts(const Point& p1, const Point& p2) |
||||
{ |
||||
return p1.x < p2.x; |
||||
} |
Loading…
Reference in new issue