问题描述
我目前正在使用tesseract OCR进行Android项目.我希望通过添加字典来微调提供给用户的结果.根据 http://code.google.com/p/tesseract-ocr/wiki /FAQ ,最好的方法是
I am currently working on a project for android using tesseract OCR. I was hoping to fine tune the results given to the user by adding a dictionary. According to http://code.google.com/p/tesseract-ocr/wiki/FAQ , the best way to go about this would be to
但是,在tessdata文件夹中没有eng.user-words文件时,我假设如果我只创建一个带有字典的文本文件,则它将永远不会使用..
However there is no eng.user-words file in the tessdata folder, I assume that if I just make a text file with my dictionary in it, it will never be used..
有人有过类似的经历并且知道该怎么做吗?任何建议将是一个很大的帮助.
Has anybody had a similar experience and knows what to do? Any advice would be a great help.
推荐答案
如果您使用的是tesseract 3(我以为您是).您必须重建eng.trainddata文件我打算完全替换word-dawg文件以尝试获得更好的结果(即-我检测到的单词始终相同).
if you're using tesseract 3 (which I assume you are).You'll have to rebuild your eng.trainddata fileI intended to replace the word-dawg file completely to try to get better results (ie - the words i'm detecting are always the same).
编译tesseract时,在培训目录中将需要Combine_tessdata和wordlist2dawg可执行文件.
you'll need combine_tessdata and wordlist2dawg executables in the training directory when you compile tesseract.
-
解压缩所有内容(我这样做只是为了备份eng.word-dawg,稍后您还需要unicharset)
unpack everything (i did this just to back up my eng.word-dawg, you'll also need the unicharset later)
./combine_tessdata -u eng.traineddata
创建单词列表的文本文件(wordlistfile)
create a textfile of your wordlist (wordlistfile)
创建eng.word-dawg
create a eng.word-dawg
./wordlist2dawg wordlistfile eng.word-dawg trainingdat_backup/.unicharset
替换word-dawg文件
replace the word-dawg file
./combine_tessdata -o eng.traineddata eng.word-dawg
应该的.
这篇关于Tesseract定制词典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!