问题描述
我想像下面的文章一样用Tesseract读取特定的字符序列: Tesseract OCR:是否可以强制使用特定模式?
I want to read a specific character sequence with Tesseract like this post :Tesseract OCR: is it possible to force a specific pattern?
我尝试了 bazaar 匹配模式模式为\d\d\d\A\A
和ocr的Tesseract仍然可以识别不匹配的其他单词.
I have tried bazaar matching pattern in Tesseract with the pattern \d\d\d\A\A
and ocr still recognize other words which doesn't match.
我尝试使用"tessedit_char_whitelist"参数,但无法使用该参数选择字符的位置.
I have tried to use the "tessedit_char_whitelist" parameter but I can't choose the position of the characters with that.
- 我启动命令:
tesseract image.jpg result -l eng bazaar
我收到此消息:
- I launch the command :
tesseract image.jpg result -l eng bazaar
And I have this message :
无效的用户模式\A\A\d\d\d
带有Leptonica的Tesseract开源OCR引擎v3.01
Tesseract Open Source OCR Engine v3.01 with Leptonica
- image.jpg:
-
结果:
The result :
AB123
ABC12
A1234
12345
ABCD1
所以错了,我只想捕捉序列"AB123".
So it is wrong, I just wanted to catch the sequence "AB123".
有人可以告诉我为什么我的用户模式文件中的正则表达式无效吗?对于配置,我严格遵循了集市教程.
Can somebody tell me why the regular expression in my user-patterns file as no effect ? For the configuration, I have strictly followed the bazaar tutorial.
推荐答案
请尝试将此模式与量词一起使用.
Try using this pattern with quantifiers instead.
[a-zA-Z]{2}\d{3}
这应该只覆盖2个字母字符和3个数字.
This should cover only 2 alphabetical characters and 3 digits.
您之前匹配所有内容的原因是\ w是字母数字.
The reason why you are matching everything before is because \w is alphanumeric.
这篇关于Tesseract OCR力模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!