如何在成千上万的PDF文件中抓取表格?

本文介绍了如何在成千上万的PDF文件中抓取表格?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我大约有1500个PDF，每个PDF仅包含1页，并且显示相同的结构(请参见例如http://files.newsnetz.ch/extern/interactive/downloads/BAG_15m_kzh_2012_de.pdf ).

I have about 1'500 PDFs consisting of only 1 page each, and exhibiting the same structure (see http://files.newsnetz.ch/extern/interactive/downloads/BAG_15m_kzh_2012_de.pdf for an example).

我正在寻找一种遍历所有这些文件(在本地，如果可能的话)并提取表的实际内容(作为CSV，存储到SQLite DB中，等等)的方法.

What I am looking for is a way to iterate over all these files (locally, if possible) and extract the actual contents of the table (as CSV, stored into a SQLite DB, whatever).

我很想在Node.js中做到这一点，但是找不到任何合适的库来解析这些东西.你知道吗

I would love to do this in Node.js, but couldn't find any suitable libraries for parsing such stuff. Do you know of any?

如果没有更好的方法，如果在Node.js中无法实现，我也可以在Python中进行编码.

If not possible in Node.js, I could also code it in Python, if there are better methods available.

推荐答案

我以前并不知道，但是less具有读取pdf文件的神奇能力.我可以使用以下脚本从您的示例pdf中提取表数据:

I didn't know this before, but less has this magical ability to read pdf files. I was able to extract the table data from your example pdf with this script:

import subprocess
import re

output = subprocess.check_output(["less","BAG_15m_kzh_2012_de.pdf"])

re_data_prefix = re.compile("^[0-9]+[.].*$")
re_data_fields = re.compile("(([^ ]+[ ]?)+)")
for line in output.splitlines():
    if re_data_prefix.match(line):
        print [l[0].strip() for l in re_data_fields.findall(line)]

这篇关于如何在成千上万的PDF文件中抓取表格?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！