问题描述
我的一个做实习的朋友2小时前问我是否可以帮助他避免使用免费的在线软件手动将462 pdf文件转换为.xls.
我想到了使用unoconv
的shell脚本,但是我没有找到如何正确使用它的方法,而且我不确定unoconv
是否可以解决此问题,因为它主要将文件转换为pdf,而不是反过来.
从PDF转换为任何其他结构化格式并不总是可能的,并且通常不建议这样做.
话虽如此,这看起来确实像是一笔过的工作,而且他们中的很少一部分(462).
如果您可以可靠地从大多数文本中提取文本并且结构合理,那是值得追求的.尝试在可以可靠地解析为表格结构的PDF样本中获取常规文本输出.
围绕Google的直接或基于OCR的文本提取有很多工具.
我喜欢的一个是ghostscript套件中的伪文本. -bboxes
选项可让我获取每个单词的坐标,然后由我自己来重新组装结构.尽管名称如此,但它确实适用于输入PDF.缺点是它可能有点夸张,并且可以在某些PDF上工作,而不能在其他PDF上工作.
如果到此为止,那么您很可能需要编写shell脚本或程序以将其转换为CSV.您可以直接通过电子表格打开此文件,也可以寻找将其转换为XLS的工具.
PS如果他还没有的话,请实习生询问是否有任何可能的方法来获取用于创建PDF的原始数据.这将节省大量的时间和精力,并导致更准确的方法结果.
更新 pstotext
的替代方法是renderpdf.pl
命令,该命令包含在Perl CAM :: PDF 模块.功能更强大,但只报告文本(x,y)的位置,而不报告边界框.
A friend of mine doing an internship asked me 2 hours ago if I could help him avoid to do manually 462 pdf file to .xls using free online soft.
I thought of a shell script using unoconv
, but I didn't find out how to use it properly, and I am not sure if unoconv
can solve this problem since it mainly converts file to pdf, not the reverse thing.
Conversion from PDF to any other structured format is not always possible and not generally recommended.
Having said that, this does look like a one-off job and theirs a fair few of them (462).
It's worth pursuing, if you can reliably extract text from most of them and it's reasonably structured. It's a matter of trying to get regular text output across a sample of the PDF's that you can reliably parse into a table structure.
There's plenty of tools around that target either direct or OCR based text extraction, just google around.
One I like is pstotext from the ghostscript suite; the -bboxes
option lets me get the coordinates of each word and leaves it up to me to re-assemble the structure. Despite its name it does work on input PDFs. Downside is that it can be a bit flakey and works on some PDF's but not others.
If you get this far, you'd then most likely then need to write a shell-script or program to convert that to a CSV. You can either open this directly via a spread-sheet or look for tools to convert this into XLS.
PS If he hasn't already, get the intern to ask if there's any possible way of getting at the original data that was used to created the PDFs It will save a lot of time and effort and lead to a way more accurate result.
Update An alternative to pstotext
is renderpdf.pl
command which is included in the Perl CAM::PDF module. More robust, but just reports text (x,y) position, not bounding boxes.
这篇关于将.pdf文件转换为excel(.xls)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!