本文介绍了从PDF python提取/识别表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!



Are there any open source libraries that support table identification & extraction?


  1. 确定表结构存在
  2. 根据内容对表格进行分类
  3. 以有用的输出格式从表中提取数据,例如JSON/CSV等


I have looked through similar questions on this topic and found the following:

  • PDFMiner 解决了问题3,但似乎用户需要向PDFMiner指定存在表结构的位置每张桌子(如果我输入错了,请纠正我)
  • pdf-table-extract 尝试解决问题1,但根据列表,目前无法识别由空格分隔的表.这是一个问题,因为我的PDF中的所有表格都由空格分隔!
  • PDFMiner which addresses problem 3, but it seems the user is required to specify to PDFMiner where a table structure exists for each table (correct me if I'm wrong)
  • pdf-table-extract which attempts to address problem 1 but according to the To-Do list, cannot currently identify tables that are separated by whitespace. This is a problem as all tables in my PDFs are separated by whitespace!


Currently, I am thinking that I would have to spend a lot of time developing a Machine Learning solution to identify table structures from PDFs. Therefore, any alternative approaches would be more than welcome!



You should definitely have a look at this answer of mine:


and also have a look at all the links included therein.

Tabula/TabulaPDF 当前是可用于PDF抓取的最佳表格提取工具.

Tabula/TabulaPDF is currently the best table extraction tool that is available for PDF scraping.

这篇关于从PDF python提取/识别表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-12 11:39