问题描述
是否有任何支持表识别和分析的开放源代码库?提取?
Are there any open source libraries that support table identification & extraction?
我的意思是:
- 确定表结构存在
- 根据内容对表格进行分类
- 以有用的输出格式从表中提取数据,例如JSON/CSV等
我仔细研究了与此主题相关的类似问题,并发现了以下内容:
I have looked through similar questions on this topic and found the following:
- PDFMiner 解决了问题3,但似乎用户需要向PDFMiner指定存在表结构的位置每张桌子(如果我输入错了,请纠正我)
- pdf-table-extract 尝试解决问题1,但根据列表,目前无法识别由空格分隔的表.这是一个问题,因为我的PDF中的所有表格都由空格分隔!
- PDFMiner which addresses problem 3, but it seems the user is required to specify to PDFMiner where a table structure exists for each table (correct me if I'm wrong)
- pdf-table-extract which attempts to address problem 1 but according to the To-Do list, cannot currently identify tables that are separated by whitespace. This is a problem as all tables in my PDFs are separated by whitespace!
目前,我认为我将不得不花费大量时间来开发机器学习解决方案以从PDF识别表结构.因此,任何其他替代方法都将受到欢迎!
Currently, I am thinking that I would have to spend a lot of time developing a Machine Learning solution to identify table structures from PDFs. Therefore, any alternative approaches would be more than welcome!
推荐答案
您肯定应该看看我的答案:
You should definitely have a look at this answer of mine:
,并查看其中包含的所有链接.
and also have a look at all the links included therein.
Tabula/TabulaPDF 当前是可用于PDF抓取的最佳表格提取工具.
Tabula/TabulaPDF is currently the best table extraction tool that is available for PDF scraping.
这篇关于从PDF python提取/识别表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!