使用 openjpeg2 运行 tesseract 4.1 - 无法生成 pdf 输出

本文介绍了使用 openjpeg2 运行 tesseract 4.1 - 无法生成 pdf 输出的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经在我的 RedHat 机器上安装了:

(py36_maw) [rvp@lib-archcoll box]$ tesseract -v超立方体 4.1.0leptonica-1.78.0libjpeg 6b(libjpeg-turbo 1.2.90):libpng 1.5.13:libtiff 4.0.3:zlib 1.2.7:libopenjp2 2.3.1找到上证所

我尝试根据我能找到的文档运行以生成 pdf 输出:

(py36_maw) [rvp@lib-archcoll box]$ time tesseract test.jp2 out -l eng PDFread_params_file: 无法打开 PDFTesseract 开源 OCR 引擎 v4.1.0 与 Leptonica警告:分辨率 0 dpi 无效.使用 70 代替.估计分辨率为 275

这需要 10 秒，并生成具有良好 OCR 到文本转换明显的文件 out.txt.

但是，它尝试读取一个名为 PDF 的文件，但我无法弄清楚如何获取 PDF 输出.

我已经阅读了各种文档，最有希望的似乎是建议编辑配置文件，但我能猜到的唯一文档是相关的，通过谷歌搜索tesseract 4.1 配置"，列出许多配置"变量名称，对于较旧的tesseract 的版本，但似乎没有一个表明我可以指定生成 pdf 输出，更不用说专门针对 tesseract 4.1.

如何通过 CLI 调用 tesseract 4.1(使用 libopenjp2 2.3.1)以从我的 jp2 输入文件生成 pdf 输出?额外问题:如何让它在一次运行中同时生成 txt 和 pdf 输出?

罗伯特

解决方案

经过更多的冲浪和挖掘，假设读者也做了一些并且知道 tesseract 使用 TESSDATA_PREFIX 是什么，以下是对我有用的步骤:

从以下位置下载 pdf.ttf 文件:https://github.com/tesseract-ocr/tesseract/blob/master/tessdata/pdf.ttf
将 pdf.ttf 复制到您的目录 $TESSDATA_PREFIX 并确保将变量导出到您的 shell.
提示:使用命令:tesseract --print-parameters # 发现您可以在自己的配置文件中使用的已定义变量名称
使用 test.jp2 文件转到您的目录，并使用这些行创建文件配置.

tessedit_create_pdf 1 写入 .pdf 输出文件tessedit_create txt 1 写入 .txt 输出文件

(注意:或者您也可以将配置文件放在 TESSDATA_PREFIX 目录中并让它始终为默认值.未测试.)

$ tesseract test.jp2 outputbase -l eng config

验证您的成功:它运行并生成文件 outputbase.txt 和 outputbase.pdf.txt 文件看起来不错，可搜索的 pdf 在 pdf 查看器中看起来和工作正常，也就是说，您可以搜索和查找文本字符串.

希望这对其他人有帮助！

I have installed on my RedHat machine:

(py36_maw) [rvp@lib-archcoll box]$ tesseract -v
tesseract 4.1.0
 leptonica-1.78.0
  libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 : libtiff 4.0.3 : zlib 1.2.7 : libopenjp2 2.3.1
 Found SSE

I try to run, per what docs I can find, to produce pdf output:

(py36_maw) [rvp@lib-archcoll box]$ time tesseract test.jp2 out -l eng PDF
read_params_file: Can't open PDF
Tesseract Open Source OCR Engine v4.1.0 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 275

That takes 10 seconds and produces file out.txt with fine OCR to text conversion evident.

However, it tries to read a file called PDF, but I cannot figure how to get PDF output.

I have read various docs, the most promising seeming to be advising to edit the config file, but the only docs I can guess are relevant, by googling 'tesseract 4.1 config', list many 'config' variable names, for older versions of tesseract, but none of which seems to indicate I can specify producing pdf output, much less specifically for tesseract 4.1.

How can I invoke tesseract 4.1 (using libopenjp2 2.3.1) via CLI to produce pdf output from my jp2 input file? Bonus question: how can I get it to produce both txt and pdf output in one run?

Robert

解决方案

After more surfing and digging, assuming the reader also has done some and knows what TESSDATA_PREFIX is used for by tesseract, here are the steps that worked for me:

Download the pdf.ttf file from: https://github.com/tesseract-ocr/tesseract/blob/master/tessdata/pdf.ttf
Copy pdf.ttf to your directory $TESSDATA_PREFIX and make sure that variable is exported to your shell.
TIP: Use command: tesseract --print-parameters # to discover defined variable names you can use in your own config file
Go to your dir with the test.jp2 file and create file config with these lines.

(Note: or you may be able to put the config file in the TESSDATA_PREFIX directory as well and let it always be the default. Not tested.)

Run in that dir:

Verify your success: it runs and produces files outputbase.txt and outputbase.pdf. The txt file looks good and the searchable pdf looks and works OK in a pdf viewer, that is, you can search and find text strings.

Hope this helps someone else!

这篇关于使用 openjpeg2 运行 tesseract 4.1 - 无法生成 pdf 输出的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！