问题描述
我已经在我的 RedHat 机器上安装了:
(py36_maw) [rvp@lib-archcoll box]$ tesseract -v超立方体 4.1.0leptonica-1.78.0libjpeg 6b(libjpeg-turbo 1.2.90):libpng 1.5.13:libtiff 4.0.3:zlib 1.2.7:libopenjp2 2.3.1找到上证所
我尝试根据我能找到的文档运行以生成 pdf 输出:
(py36_maw) [rvp@lib-archcoll box]$ time tesseract test.jp2 out -l eng PDFread_params_file: 无法打开 PDFTesseract 开源 OCR 引擎 v4.1.0 与 Leptonica警告:分辨率 0 dpi 无效.使用 70 代替.估计分辨率为 275
这需要 10 秒,并生成具有良好 OCR 到文本转换明显的文件 out.txt.
但是,它尝试读取一个名为 PDF 的文件,但我无法弄清楚如何获取 PDF 输出.
我已经阅读了各种文档,最有希望的似乎是建议编辑配置文件,但我能猜到的唯一文档是相关的,通过谷歌搜索tesseract 4.1 配置",列出许多配置"变量名称,对于较旧的tesseract 的版本,但似乎没有一个表明我可以指定生成 pdf 输出,更不用说专门针对 tesseract 4.1.
如何通过 CLI 调用 tesseract 4.1(使用 libopenjp2 2.3.1)以从我的 jp2 输入文件生成 pdf 输出?额外问题:如何让它在一次运行中同时生成 txt 和 pdf 输出?
罗伯特
经过更多的冲浪和挖掘,假设读者也做了一些并且知道 tesseract 使用 TESSDATA_PREFIX 是什么,以下是对我有用的步骤:
- 从以下位置下载 pdf.ttf 文件:https://github.com/tesseract-ocr/tesseract/blob/master/tessdata/pdf.ttf
- 将 pdf.ttf 复制到您的目录 $TESSDATA_PREFIX 并确保将变量导出到您的 shell.
- 提示:使用命令:tesseract --print-parameters # 发现您可以在自己的配置文件中使用的已定义变量名称
- 使用 test.jp2 文件转到您的目录,并使用这些行创建文件配置.
tessedit_create_pdf 1 写入 .pdf 输出文件tessedit_create txt 1 写入 .txt 输出文件
(注意:或者您也可以将配置文件放在 TESSDATA_PREFIX 目录中并让它始终为默认值.未测试.)
- 在该目录中运行:
$ tesseract test.jp2 outputbase -l eng config
- 验证您的成功:它运行并生成文件 outputbase.txt 和 outputbase.pdf.txt 文件看起来不错,可搜索的 pdf 在 pdf 查看器中看起来和工作正常,也就是说,您可以搜索和查找文本字符串.
希望这对其他人有帮助!
I have installed on my RedHat machine:
(py36_maw) [rvp@lib-archcoll box]$ tesseract -v
tesseract 4.1.0
leptonica-1.78.0
libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 : libtiff 4.0.3 : zlib 1.2.7 : libopenjp2 2.3.1
Found SSE
I try to run, per what docs I can find, to produce pdf output:
(py36_maw) [rvp@lib-archcoll box]$ time tesseract test.jp2 out -l eng PDF
read_params_file: Can't open PDF
Tesseract Open Source OCR Engine v4.1.0 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 275
That takes 10 seconds and produces file out.txt with fine OCR to text conversion evident.
However, it tries to read a file called PDF, but I cannot figure how to get PDF output.
I have read various docs, the most promising seeming to be advising to edit the config file, but the only docs I can guess are relevant, by googling 'tesseract 4.1 config', list many 'config' variable names, for older versions of tesseract, but none of which seems to indicate I can specify producing pdf output, much less specifically for tesseract 4.1.
How can I invoke tesseract 4.1 (using libopenjp2 2.3.1) via CLI to produce pdf output from my jp2 input file? Bonus question: how can I get it to produce both txt and pdf output in one run?
Robert
After more surfing and digging, assuming the reader also has done some and knows what TESSDATA_PREFIX is used for by tesseract, here are the steps that worked for me:
- Download the pdf.ttf file from: https://github.com/tesseract-ocr/tesseract/blob/master/tessdata/pdf.ttf
- Copy pdf.ttf to your directory $TESSDATA_PREFIX and make sure that variable is exported to your shell.
- TIP: Use command: tesseract --print-parameters # to discover defined variable names you can use in your own config file
- Go to your dir with the test.jp2 file and create file config with these lines.
(Note: or you may be able to put the config file in the TESSDATA_PREFIX directory as well and let it always be the default. Not tested.)
- Run in that dir:
- Verify your success: it runs and produces files outputbase.txt and outputbase.pdf. The txt file looks good and the searchable pdf looks and works OK in a pdf viewer, that is, you can search and find text strings.
Hope this helps someone else!
这篇关于使用 openjpeg2 运行 tesseract 4.1 - 无法生成 pdf 输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!