Tesseract定制词典

本文介绍了Tesseract定制词典的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我目前正在使用tesseract OCR进行Android项目.我希望通过添加字典来微调提供给用户的结果.根据 http://code.google.com/p/tesseract-ocr/wiki /FAQ ，最好的方法是

I am currently working on a project for android using tesseract OCR. I was hoping to fine tune the results given to the user by adding a dictionary. According to http://code.google.com/p/tesseract-ocr/wiki/FAQ , the best way to go about this would be to

但是，在tessdata文件夹中没有eng.user-words文件时，我假设如果我只创建一个带有字典的文本文件，则它将永远不会使用..

However there is no eng.user-words file in the tessdata folder, I assume that if I just make a text file with my dictionary in it, it will never be used..

有人有过类似的经历并且知道该怎么做吗?任何建议将是一个很大的帮助.

Has anybody had a similar experience and knows what to do? Any advice would be a great help.

推荐答案

如果您使用的是tesseract 3(我以为您是).您必须重建eng.trainddata文件我打算完全替换word-dawg文件以尝试获得更好的结果(即-我检测到的单词始终相同).

if you're using tesseract 3 (which I assume you are).You'll have to rebuild your eng.trainddata fileI intended to replace the word-dawg file completely to try to get better results (ie - the words i'm detecting are always the same).

编译tesseract时，在培训目录中将需要Combine_tessdata和wordlist2dawg可执行文件.

you'll need combine_tessdata and wordlist2dawg executables in the training directory when you compile tesseract.

解压缩所有内容(我这样做只是为了备份eng.word-dawg，稍后您还需要unicharset)

unpack everything (i did this just to back up my eng.word-dawg, you'll also need the unicharset later)

./combine_tessdata -u eng.traineddata

创建单词列表的文本文件(wordlistfile)

create a textfile of your wordlist (wordlistfile)

创建eng.word-dawg

create a eng.word-dawg

./wordlist2dawg wordlistfile eng.word-dawg trainingdat_backup/.unicharset

替换word-dawg文件

replace the word-dawg file

./combine_tessdata -o eng.traineddata eng.word-dawg

应该的.

这篇关于Tesseract定制词典的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！