从PDF文件集中提取表格内容

介绍Tabula:上传PDF，获取表格CSV数据. of！ Tabula-Extractor:Tabula的命令行界面 Tabula源代码存储库 Tabula API(即将推出，尚未准备就绪)所以:去寻找Tabula.如果有任何工具可以满足您的需求，那么Tabula可能是最适合的工作！更新我最近创建了一个 ASCiinema屏幕录像，该示例演示了如何使用Tabula命令行界面从PDF中将大表提取为CSV:(点击上方的图片可查看其运行情况.如果运行速度太快而无法阅读所有文本，请使用暂停" 按钮( || -符号).它在这里托管: https://asciinema.org/a/22761I have a stack of PDFs - potentially hundreds or thousands. They are not all formatted the same, but any of them MAY have one or more tables with interesting information that I would like to collect into a separate database.Of course, I know I have to write something to do this. Perl is an option for me - or perhaps Java. I don't really care what language so long as it's free (or cheap with a free trial period to ensure it suits my purposes).I'm looking at CAM::Parse (using strawberry Perl), but I'm not sure how to use it to locate and extract tables from the files. I guess I do have a preference for Perl, but really I want something that works dependably and is reasonably easy to do string manipulations with.What is a good approach for something like this? I'm at square one, so if java (or python etc.) have better hooks, now is a good time to know about it. General pointers good; starter code would be strongly preferred. 解决方案The PDF format from its inception (more than 20 years ago) never was intended to be host of extractable, meaningfully structured data.Its purpose was to be a reliable visual representation of text, images and diagrams in a document -- a kind of digital paper (that would also reliably be transferred to real paper via printing). Only later in its development more features were added, which should help in extracting data again (google for Tagged PDF).For some examples of problems which are posed when data scraping tables from PDFs, see this article:Why Updating Dollars for Docs Was So Difficult Contradicting my point '1.' above, now I say this: for an amazing family of tools that gets better and better from week to week for extracting tabular data from PDFs (unless they are scanned pages), see these links:Introducing Tabula: Upload a PDF, get back tabular CSV data. Poof!Tabula-Extractor: A Command Line Interface to TabulaTabula source code repositoryTabula API (upcoming, not ready yet)So: go look for Tabula. If any tools can do what you want, at this time Tabula is probably amongst the best for the job!UpdateI've recently created an ASCiinema screencast demonstrating the use of the Tabula command line interface to extract a big table from a PDF as CSV:(Click on image above to see it running. If it runs too fast for you to read all text, make use of the "Pause" button (||-symbol).)It is hosted here:https://asciinema.org/a/22761 这篇关于从PDF文件集中提取表格内容的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！上岸，阿里云！