问题描述
我想使用Ghostscript从PDF的一部分中提取文本(使用坐标).
有人可以帮我吗?
是的,使用Ghostscript,您可以从PDF中提取文本.但是,不是,这不是完成这项工作的最佳工具.不,您不能在部分"(单页的一部分)中执行此操作.您可以执行的操作:仅提取特定范围页面的文本.
首先: Ghostscript的 txtwrite
输出设备(不太好)
gs \
-dBATCH \
-dNOPAUSE \
-sDEVICE=txtwrite \
-dFirstPage=3 \
-dLastPage=5 \
-sOutputFile=- \
/path/to/your/pdf
这会将第3-5页上包含的所有文本输出到stdout.如果要输出到文本文件,请使用
-sOutputFile=textfilename.txt
gs
更新:
最新版本的Ghostscript在txtwrite
设备和错误修复方面已取得重大改进.请参见最近的Ghostscript更改日志(在该页面上搜索 txtwrite )以获取详细信息.
第二:Ghostscript的 ps2ascii.ps
PostScript实用程序(更好)
这需要您从 Ghostscript Git源代码存储库.您必须将PDF转换为PostScript,然后在PS文件上运行以下命令:
gs \
-q \
-dNODISPLAY \
-P- \
-dSAFER \
-dDELAYBIND \
-dWRITESYSTEMDICT \
-dSIMPLE \
/path/to/ps2ascii.ps \
input.ps \
-c quit
如果未定义-dSIMPLE
参数,则除了纯文本内容以外,每条输出行还包含一些有关所用字体和字体大小的附加信息.
如果用-dCOMPLEX
替换该参数,则会获得有关所用颜色和图像的其他信息.
阅读 ps2ascii.ps 中的注释以了解有关此实用程序的更多信息.使用起来很不舒服,但是对我来说,在大多数情况下我都需要它....
第三: XPDF的 pdftotext
CLI实用程序(比Ghostscript更舒适)
一种更舒适的文本提取方式:使用pdftotext
(适用于Windows以及Linux/Unix或Mac OS X).该实用程序基于Poppler或XPDF.这是您可以尝试的命令:
pdftotext \
-f 13 \
-l 17 \
-layout \
-opw supersecret \
-upw secret \
-eol unix \
-nopgbrk \
/path/to/your/pdf
- |less
这将显示页面范围13( f 第一页)到17( l 最后一页),保留双密码保护的命名PDF文件的布局(使用用户密码和所有者密码 secret 和 supersecret ),并采用Unix EOL约定,但不在PDF页面之间插入分页符,通过更少的管道...
pdftotext -h
显示所有可用的命令行选项.
当然,这两个工具仅适用于PDF的文本部分(如果有的话).哦,数学公式也不太好...;-)
pdftotext
更新:
最新版本的Poppler pdftotext
现在可以选择提取"PDF的一部分(使用坐标)" 页面,如OP所要求的那样.参数为:
-
-x <int>
:作物区域左上角的x坐标 -
-y <int>
:作物区域左上角的y坐标 -
-W <int>
:裁剪区域的宽度(以像素为单位)(默认为0) -
-H <int>
:裁剪区域的高度(以像素为单位)(默认为0)
最好,如果与-layout
参数一起使用.
第四:MuPDF的mutool draw
命令还可以提取文本
跨平台的开源 MuPDF 应用程序(由同一家公司也开发了Ghostscript)捆绑了一个命令行工具mutool
.要使用此工具从PDF中提取文本,请使用:
mutool draw -F txt the.pdf
会将提取的文本发送到<stdout>
.使用-o filename.txt
将其写入文件.
第五:PDFLib的文本提取工具包(TET)(是最好的……但它是PayWare)
TET , pdflib 系列产品可以在PDF文件中(甚至更多)找到文本内容的xy坐标. TET具有命令行界面,它是我所知道的所有文本提取工具中最强大的. (它甚至可以处理连字...)从他们的网站报价:
以我的经验,虽然它不是最简单的CLI界面,但是您可以想象:习惯之后,它会按其预期的那样工作,对于大多数PDF来说,您... /p>
还有更多选择:
- 来自PoDoFo项目(开放源代码)的
-
podofotxtextract
(CLI工具) -
calibre
(通常是用于处理电子书的GUI程序,开源)具有命令行选项可以从PDF提取文本 -
AbiWord
(GUI文字处理器,开源)可以导入PDF并保存其文件作为.txt:abiword --to=txt --to-name=output.txt input.pdf
I would like to extract text from a portion (using coordinates) of PDF using Ghostscript.
Can anyone help me out?
Yes, with Ghostscript, you can extract text from PDFs. But no, it is not the best tool for the job. And no, you cannot do it in "portions" (parts of single pages). What you can do: extract the text of a certain range of pages only.
First: Ghostscript's txtwrite
output device (not so good)
gs \
-dBATCH \
-dNOPAUSE \
-sDEVICE=txtwrite \
-dFirstPage=3 \
-dLastPage=5 \
-sOutputFile=- \
/path/to/your/pdf
This will output all text contained on pages 3-5 to stdout. If you want output to a text file, use
-sOutputFile=textfilename.txt
gs
Update:
Recent versions of Ghostscript have seen major improvements in the txtwrite
device and bug fixes. See recent Ghostscript changelogs (search for txtwrite on that page) for details.
Second: Ghostscript's ps2ascii.ps
PostScript utility (better)
This one requires you to download the latest version of the file ps2ascii.ps from the Ghostscript Git source code repository. You'd have to convert your PDF to PostScript, then run this command on the PS file:
gs \
-q \
-dNODISPLAY \
-P- \
-dSAFER \
-dDELAYBIND \
-dWRITESYSTEMDICT \
-dSIMPLE \
/path/to/ps2ascii.ps \
input.ps \
-c quit
If the -dSIMPLE
parameter is not defined, each output line contains some additional info beyond the pure text content about fonts and fontsize used.
If you replace that parameter by -dCOMPLEX
, you'll get additional infos about colors and images used.
Read the comments inside the ps2ascii.ps to learn more about this utility. It's not comfortable to use, but for me it worked in most cases I needed it....
Third: XPDF's pdftotext
CLI utility (more comfortable than Ghostscript)
A more comfortable way to do text extraction: use pdftotext
(available for Windows as well as Linux/Unix or Mac OS X). This utility is based either on Poppler or on XPDF. This is a command you could try:
pdftotext \
-f 13 \
-l 17 \
-layout \
-opw supersecret \
-upw secret \
-eol unix \
-nopgbrk \
/path/to/your/pdf
- |less
This will display the page range 13 (first page) to 17 (last page), preserve the layout of a double-password protected named PDF file (using user and owner passwords secret and supersecret), with Unix EOL convention, but without inserting pagebreaks between PDF pages, piped through less...
pdftotext -h
displays all available commandline options.
Of course, both tools only work for the text parts of PDFs (if they have any). Oh, and mathematical formula also won't work too well... ;-)
pdftotext
Update:
Recent versions of Poppler's pdftotext
have now options to extract "a portion (using coordinates) of PDF" pages, like the OP asked for. The parameters are:
-x <int>
: top left corner's x-coordinate of crop area-y <int>
: top left corner's y-coordinate of crop area-W <int>
: crop area's width in pixels (defaults to 0)-H <int>
: crop area's height in pixels (defaults to 0)
Best, if used with the -layout
parameter.
Fourth: MuPDF's mutool draw
command can also extract text
The cross-platform, open source MuPDF application (made by the same company that also develops Ghostscript) has bundled a command line tool, mutool
. To extract text from a PDF with this tool, use:
mutool draw -F txt the.pdf
will emit the extracted text to <stdout>
. Use -o filename.txt
to write it into a file.
Fifth: PDFLib's Text Extraction Toolkit (TET) (best of all... but it is PayWare)
TET, the Text Extraction Toolkit from the pdflib family of products can find the x-y-coordinate of text content in a PDF file (and much more). TET has a commandline interface, and it's the most powerful of all text extraction tools I'm aware of. (It can even handle ligatures...) Quote from their website:
In my experience, while it's does not sport the most straight-forward CLI interface you can imagine: after you got used to it, it will do what it promises to do, for most PDFs you throw towards it...
And there are even more options:
podofotxtextract
(CLI tool) from the PoDoFo project (Open Source)calibre
(normally a GUI program to handle eBooks, Open Source) has a commandline option that can extract text from PDFsAbiWord
(a GUI word processor, Open Source) can import PDFs and save its files as .txt:abiword --to=txt --to-name=output.txt input.pdf
这篇关于带坐标的PDF文本提取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!