问题描述
我有一个使用 tesseract API 对技术数据表进行 OCR 处理的应用程序.我是这样初始化的:
I have an application where technical datasheets are OCR'd using the tesseract API. I initialize it like this:
tesseract::TessBaseAPI tess;
tess.Init(NULL, "eng", tesseract::OEM_TESSERACT_ONLY);
然而,即使使用了这样的自定义白名单
However, even after using custom whitelists like this
tess.SetVariable("tessedit_char_blacklist", "");
tess.SetVariable("tessedit_char_whitelist", myWhitelist);
某些数据表条目被错误识别,例如 PA3
被识别为 FAB
.
some datasheet entries are recognized wrongly, for example PA3
is recognized as FAB
.
如何禁用字典辅助 OCR,即 .为了不影响其他工具,如果可能的话,我不想修改全局配置文件.
How can I disable the dictionary-assisted OCR, i.e. . In order to not affect other tools I don't want to modify global config files if possible.
注意:这不是重复这个上一个问题因为上述问题明确要求命令行工具,而我明确要求tesseract API.
Note: This is not a duplicate of this previous question because said question explicitly asks for the command-line tool while I explicitly ask for the tesseract API.
推荐答案
您可以通过以下方式进行
You can do it in following way
tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
if (api->Init(NULL, "eng"))
{
fprintf(stderr, "Could not initialize tesseract.\n");
exit(1);
}
if(!api->SetVariable("tessedit_enable_doc_dict", "0"))
{
cout << "Unable to enable dictionary" << endl;
}
只需将 "tessedit_enable_doc_dict"
作为参数传递给 SetVariable
函数及其对应的布尔值.
Simply pass "tessedit_enable_doc_dict"
as a parameter to SetVariable
function and it's corresponding boolean value.
我在 tesseractclass.h
https://tesseract-ocr.github.io/a00736_source.html 头文件(第 839 行),我想找到正确参数的最佳方法是查看其中定义的值(与您的版本相对应的头文件.我的是 3.04).我尝试了一些我之前在互联网上找到的但没有奏效的方法.这是我的工作配置.
I found it in tesseractclass.h
https://tesseract-ocr.github.io/a00736_source.html header file(line 839) and i guess best way to find correct parameters is by looking at the values defined at it(header file corresponding to your version. mine is 3.04).I tried few i found on internet before but didn't work. This was the working configuration to me.
这篇关于在 tesseract C++ API 中禁用字典辅助 OCR的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!