问题描述
以前曾有人问过这个问题,但我真的不知道答案是否对我有帮助.这是我的问题:我得到了一堆(约10,000个)pdf文件.其中一些是使用Adobe的打印功能保存的文本文件(因此它们的文本是完美的,我不想冒险将它们弄乱).还有一些扫描过的图像(因此它们没有任何文字,我将不得不接受OCR的处理).这些文件位于同一目录中,我不知道是哪个.最终,我想将它们转换为.txt文件,然后对它们进行字符串处理.所以我想要最准确的OCR.
This has been asked before, but I don't really know if the answers help me. Here is my problem: I got a bunch of (10,000 or so) pdf files. Some were text files that were saved using adobe's print feature (so their text is perfect and I don't want to risk screwing them up). And some were scanned images (so they don't have any text and I will have to settle for OCR). The files are in the same directory and I can't tell which is which. Ultimately I want to turn them into .txt files and then do string processing on them. So I want the most accurate OCR possible.
似乎人们推荐了:
- adobe pdf(我没有此文件的许可副本,所以……再加上ABBYY Finereader或其他更好的东西,如果我不使用它,为什么要付款)
- ocropus(我不知道该怎么用)
- Tesseract(在1995年看起来很棒,但我不确定是否有更准确的信息,而且它本身不会生成pdf文件,我必须转换为TIFF.这引起了我自己的问题,没有acrobat的许可副本,所以我不知道如何将10,000个文件转换为tiff.再加上,我不希望将10,000个30页的文档转换为30,000个单独的tiff图像).
- wowocr
- pdftextstream(从2009年开始)
- ABBYY FineReader(显然是"$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$"的影响,如果$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ 2013 $ 1.2 $ 3.2次测试的记录,如果有更好的结果,即具有更准确的ocr).
- adobe pdf (I don't have a licensed copy of this so ... plus if ABBYY finereader or something is better, why pay for it if I won't use it)
- ocropus (I can't figure out how to use this thing),
- Tesseract (which seems like it was great in 1995 but I'm not sure if there's something more accurate plus it doesn't do pdfs natively and I've have to convert to TIFF. that raises its own problem as I don't have a licensed copy of acrobat so I don't know how I'd convert 10,000 files to tiff. plus i don't want 10,000 30 page documents turned into 30,000 individual tiff images).
- wowocr
- pdftextstream (that was from 2009)
- ABBYY FineReader (apparently its' $$$, but I will spend $600 to get this done if this thing is significantly better, i.e. has more accurate ocr).
我对编程也很热衷,所以如果要花上几个星期来学习如何做某事,我宁愿付$$$.输入/经验表示感谢.
Also I am a n00b to programming so if it's going to take like weeks to learn how to do something, I would rather pay the $$$. Thx for input/experiences.
顺便说一句,我正在运行Linux Mint 11 64位和/或Windows 7 64位.
BTW, I'm running Linux Mint 11 64 bit and/or windows 7 64 bit.
以下是其他线程:
https://superuser.com/questions/107678 /batch-ocr-for-many-pdf-files-not-alreadyocred
推荐答案
只需要弄清一些误解...
Just to put some of your misconceptions straight...
我没有acrobat的许可副本,所以我不知道如何将10,000个文件转换为tiff."
您可以借助Free(如自由)和Free(如啤酒)Ghostscript将PDF转换为TIFF.如果要在Linux Mint或Windows 7上执行此操作,则可以选择.Linux的命令行是:
You can convert PDFs to TIFF with the help of Free (as in liberty) and free (as in beer) Ghostscript. Your choice if you want to do it on Linux Mint or on Windows 7. The commandline for Linux is:
gs \
-o input.tif \
-sDEVICE=tiffg4 \
input.pdf
我不希望将10,000个30页的文档转换为30,000个单独的tiff图像"
您可以轻松拥有多页" TIFF.上面的命令的确创建了具有 G4 (传真tiff)风格的TIFF.如果您甚至想要单页TIFF,则可以修改命令:
You can have "multipage" TIFFs easily. Above command does create such TIFFs of the G4 (fax tiff) flavor. Should you even want single-page TIFFs instead, you can modify the command:
gs \
-o input_page_%03d.tif \
-sDEVICE=tiffg4 \
input.pdf
输出文件名的%03d
部分将自动转换为一系列001
,002
,003
等.
The %03d
part of the output filename will automatically translate into a series of 001
, 002
, 003
etc.
注意事项:
-
tiffg4
输出设备的默认分辨率为204x196 dpi.您可能想要更好的价值.要获得720 dpi,您应该在命令行中添加-r720x720
. - 此外,如果您的Ghostscript安装使用 letter 作为其默认媒体大小,则可能需要更改它.您可以使用
-gXxY
设置设备点的widthxheight.因此,要获取横向的 ISO A4 输出页面尺寸,可以添加-g8420x5950
参数.
- The default resolution for the
tiffg4
output device is 204x196 dpi. You probably want a better value. To get 720 dpi you should add-r720x720
to the commandline. - Also, if your Ghostscript installation uses letter as its default media size, you may want to change it. You can use
-gXxY
to set widthxheight in device points. So to get ISO A4 output page dimensions in landscape you can add a-g8420x5950
parameter.
因此,控制这两个参数以在A4上纵向产生720 dpi输出的完整命令将显示为:
So the full command which controls these two parameters, to produce 720 dpi output on A4 in portrait orientation, would read:
gs \
-o input.tif \
-sDEVICE=tiffg4 \
-r720x720 \
-g5950x8420 \
input.pdf
这篇关于PDF的批处理OCR程序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!