本文介绍了从PDF python提取/识别表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有任何支持表识别和分析的开放源代码库?提取?

Are there any open source libraries that support table identification & extraction?

我的意思是:

  1. 确定表结构存在
  2. 根据内容对表格进行分类
  3. 以有用的输出格式从表中提取数据,例如JSON/CSV等

我仔细研究了与此主题相关的类似问题,并发现了以下内容:

I have looked through similar questions on this topic and found the following:

  • PDFMiner 解决了问题3,但似乎用户需要向PDFMiner指定存在表结构的位置每张桌子(如果我输入错了,请纠正我)
  • pdf-table-extract 尝试解决问题1,但根据列表,目前无法识别由空格分隔的表.这是一个问题,因为我的PDF中的所有表格都由空格分隔!
  • PDFMiner which addresses problem 3, but it seems the user is required to specify to PDFMiner where a table structure exists for each table (correct me if I'm wrong)
  • pdf-table-extract which attempts to address problem 1 but according to the To-Do list, cannot currently identify tables that are separated by whitespace. This is a problem as all tables in my PDFs are separated by whitespace!

目前,我认为我将不得不花费大量时间来开发机器学习解决方案以从PDF识别表结构.因此,任何其他替代方法都将受到欢迎!

Currently, I am thinking that I would have to spend a lot of time developing a Machine Learning solution to identify table structures from PDFs. Therefore, any alternative approaches would be more than welcome!

推荐答案

您肯定应该看看我的答案:

You should definitely have a look at this answer of mine:

,并查看其中包含的所有链接.

and also have a look at all the links included therein.

Tabula/TabulaPDF 当前是可用于PDF抓取的最佳表格提取工具.

Tabula/TabulaPDF is currently the best table extraction tool that is available for PDF scraping.

这篇关于从PDF python提取/识别表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-12 11:39