本文介绍了如何在python(2.7)中使用Tika软件包(https://github.com/chrismattmann/tika-python)解析PDF文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试解析一些包含工程图的PDF文件,以获取文件中的文本数据.我尝试将TIKA用作jar与python并将其与jnius包一起使用(在此处使用本教程: http://www.hackzine.org/using-apache -tika-from-python-with-jnius.html ),但代码会引发错误.

I'm trying to parse a few PDF files that contain engineering drawings to obtain text data in the files. I tried using TIKA as a jar with python and using it with the jnius package (using this tutorial here: http://www.hackzine.org/using-apache-tika-from-python-with-jnius.html) but the code throws an error.

但是,使用TIKA包,我可以传递文件并对其进行解析,但是Python只能提取元数据,并且当要求解析内容时,Python返回输出"none".它能够完美地解析.txt文件,但无法提取PDF的内容.这是代码

Using the TIKA package however I was able to pass files and parse them but Python is only able to extract metadata and when asked to parse content, Python returns output "none". It is able to perfectly parse .txt files but fails for content extraction for PDFs. Here's the code

import tika
tika.initVM()
from tika import parser
parsed = parser.from_file('/path/to/file')
print parsed["metadata"]
print parsed["content"]

我是否需要其他软件包/代码行才能提取数据?

Do I require additional packages/codelines to be able to extract the data?

推荐答案

您需要下载Tika Server Jar并首先运行它.检查此链接: http://wiki.apache.org/tika/TikaJAXRS

You need to download the Tika Server Jar and run it first. Check this link: http://wiki.apache.org/tika/TikaJAXRS

  1. 下载罐子
  2. 将其存储在某个地方并以java -jar tika-server-x.x.jar --port xxxx
  3. 的身份运行
  4. 在您的代码中,您现在不需要执行tika.initVM()添加tika.TikaClientOnly = True而不是tika.initVM()
  5. parsed = parser.from_file('/path/to/file')更改为parsed = parser.from_file('/path/to/file', '/path/to/server')您将在第2步中获得服务器路径.启动tika服务器时-只需将其插入此处即可.
  1. Download the Jar
  2. Store it somewhere and run it as java -jar tika-server-x.x.jar --port xxxx
  3. In your Code you now don't need to do the tika.initVM() Add tika.TikaClientOnly = True instead of tika.initVM()
  4. Change parsed = parser.from_file('/path/to/file') to parsed = parser.from_file('/path/to/file', '/path/to/server') You will get the server path in Step 2. when the tika server initiates - just plug that in here

祝你好运!

这篇关于如何在python(2.7)中使用Tika软件包(https://github.com/chrismattmann/tika-python)解析PDF文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

11-01 08:31