如何将PDF转换为Excel或CSV在Rails 4 | 如何将PDF转换为Excel或CSV在Rails

本文介绍了如何将PDF转换为Excel或CSV在Rails 4的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我搜索了很多。我没有选择，除非这里提出。你们知道一个在线转换器有API或Gem / s，可以将PDF转换为Excel或CSV文件？

我不知道这里是否是最好的地方问这个。

我的应用程序在Rails 4.2中。
PDF文件包含一个包含大约10列的标题和大表。

更多信息：
用户通过表单上传PDF需要抓取PDF解析成CSV并读取内容。我尝试用PDF Reader Gem阅读内容，但结果并不真正有希望。

我使用过：。它很便宜。

然后我将HTML表格转换为CSV。

（这不太理想，但可行）

这是代码：

  require'httmultiparty'
 class PageTextReceiver 
 include HTTMultiParty 
 base_uri'http：// localhost：3000'
 
 def run 
 response = PageTextReceiver.post（'https://pdftables.com/api?key=myapikey'，：query => {f：File.new（/ path /到/ pdf / uploaded_pdf.pdf，r）}）
 
 File.open（'/ path / to / save / as / html / response.html'，'w' f | 
 f.puts response 
 end 
 end 
 
 def convert 
f = File.open（/ path / to / saved / html / response.html ）
 doc = Nokogiri :: HTML（f）
 csv = CSV.open（path / to / csv / t.csv，'w'，{：col_sep => ，：quote_char =>'\''，：force_quotes => true}）
 doc.xpath（'// table / tr'）。 
 tarray = [] 
 row.xpath（'td'）。each do | cell | 
 tarray<< cell.text 
 end 
 csv<< tarray 
 end 
 csv.close 
 end 
 end

现在运行它像这样：

 ＃> page = PageTextReceiver.new 
＃> page.run 
＃> page.convert

它不重构。只是证明的概念。你需要考虑性能。

我可以使用 Sidkiq 在后台运行它，并将结果移动到主线程。

I have searched a lot. I have no choice unless asking this here. Do you guys know an online convertor which has API or Gem/s that can convert PDF to Excel or CSV file?

I am not sure if here is the best place to ask this either.

My application is in Rails 4.2.PDF file has contains a header and a big table with about 10 columns.

More info:User upload the PDF via a form then I need to grab the PDF parse it to CSV and read the content. I tried to read the content with PDF Reader Gem however the result wasn't really promising.

I have used: freepdfconvert.com/pdf-excel Unfortunately then don't supply API. (I have contacted them)

Sample PDF

This piece of code convert the PDF into the text which is handy.Gem: pdf-reader

 def self.parse
    reader = PDF::Reader.new("pdf_uploaded_by_user.pdf")
    reader.pages.each do |page|
      puts page.text
    end
  end

Now if you check the sample attached PDF you will see some fields might be empty which it means I simply can't split the text line with space and put it in an array as I won't be able to map the array to the correct fields.

Thank you.

解决方案

Ok, After lots of research I couldn't find an API or even a proper software that does it. Here how I did it.

I first extract the Table out of the PDF into the Table with this API pdftables. It is cheap.

Then I convert the HTML table to CSV.

(This is not ideal but it works)

Here is the code:

require 'httmultiparty'
class PageTextReceiver
  include HTTMultiParty
  base_uri 'http://localhost:3000'

  def run
    response = PageTextReceiver.post('https://pdftables.com/api?key=myapikey', :query => { f: File.new("/path/to/pdf/uploaded_pdf.pdf", "r") })

    File.open('/path/to/save/as/html/response.html', 'w') do |f|
      f.puts response
    end
  end

  def convert
    f = File.open("/path/to/saved/html/response.html")
    doc = Nokogiri::HTML(f)
    csv = CSV.open("path/to/csv/t.csv", 'w',{:col_sep => ",", :quote_char => '\'', :force_quotes => true})
    doc.xpath('//table/tr').each do |row|
      tarray = []
      row.xpath('td').each do |cell|
        tarray << cell.text
      end
      csv << tarray
    end
    csv.close
  end
end

Now Run it like this:

#> page = PageTextReceiver.new
#> page.run
#> page.convert

It is not refactored. Just proof of concept. You need to consider performance.

I might use Sidkiq to run it in background and move the result to the main thread.

这篇关于如何将PDF转换为Excel或CSV在Rails 4的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！