问题描述
我搜索了很多。我没有选择,除非这里提出。你们知道一个在线转换器有API或Gem / s,可以将PDF转换为Excel或CSV文件?
我不知道这里是否是最好的地方问这个。
我的应用程序在Rails 4.2中。
PDF文件包含一个包含大约10列的标题和大表。
更多信息:
用户通过表单上传PDF需要抓取PDF解析成CSV并读取内容。我尝试用PDF Reader Gem阅读内容,但结果并不真正有希望。
我使用过:。它很便宜。
然后我将HTML表格转换为CSV。
(这不太理想,但可行)
这是代码:
require'httmultiparty'
class PageTextReceiver
include HTTMultiParty
base_uri'http:// localhost:3000'
def run
response = PageTextReceiver.post('https://pdftables.com/api?key=myapikey',:query => {f:File.new(/ path /到/ pdf / uploaded_pdf.pdf,r)})
File.open('/ path / to / save / as / html / response.html','w' f |
f.puts response
end
end
def convert
f = File.open(/ path / to / saved / html / response.html )
doc = Nokogiri :: HTML(f)
csv = CSV.open(path / to / csv / t.csv,'w',{:col_sep => ,:quote_char =>'\'',:force_quotes => true})
doc.xpath('// table / tr')。
tarray = []
row.xpath('td')。each do | cell |
tarray<< cell.text
end
csv<< tarray
end
csv.close
end
end
现在运行它像这样:
#> page = PageTextReceiver.new
#> page.run
#> page.convert
它不重构。只是证明的概念。你需要考虑性能。
我可以使用 Sidkiq
在后台运行它,并将结果移动到主线程。
I have searched a lot. I have no choice unless asking this here. Do you guys know an online convertor which has API or Gem/s that can convert PDF to Excel or CSV file?
I am not sure if here is the best place to ask this either.
My application is in Rails 4.2.PDF file has contains a header and a big table with about 10 columns.
More info:User upload the PDF via a form then I need to grab the PDF parse it to CSV and read the content. I tried to read the content with PDF Reader Gem however the result wasn't really promising.
I have used: freepdfconvert.com/pdf-excel Unfortunately then don't supply API. (I have contacted them)
Sample PDF
This piece of code convert the PDF into the text which is handy.Gem: pdf-reader
def self.parse
reader = PDF::Reader.new("pdf_uploaded_by_user.pdf")
reader.pages.each do |page|
puts page.text
end
end
Now if you check the sample attached PDF you will see some fields might be empty which it means I simply can't split the text line with space and put it in an array as I won't be able to map the array to the correct fields.
Thank you.
Ok, After lots of research I couldn't find an API or even a proper software that does it. Here how I did it.
I first extract the Table out of the PDF into the Table with this API pdftables. It is cheap.
Then I convert the HTML table to CSV.
(This is not ideal but it works)
Here is the code:
require 'httmultiparty'
class PageTextReceiver
include HTTMultiParty
base_uri 'http://localhost:3000'
def run
response = PageTextReceiver.post('https://pdftables.com/api?key=myapikey', :query => { f: File.new("/path/to/pdf/uploaded_pdf.pdf", "r") })
File.open('/path/to/save/as/html/response.html', 'w') do |f|
f.puts response
end
end
def convert
f = File.open("/path/to/saved/html/response.html")
doc = Nokogiri::HTML(f)
csv = CSV.open("path/to/csv/t.csv", 'w',{:col_sep => ",", :quote_char => '\'', :force_quotes => true})
doc.xpath('//table/tr').each do |row|
tarray = []
row.xpath('td').each do |cell|
tarray << cell.text
end
csv << tarray
end
csv.close
end
end
Now Run it like this:
#> page = PageTextReceiver.new
#> page.run
#> page.convert
It is not refactored. Just proof of concept. You need to consider performance.
I might use Sidkiq
to run it in background and move the result to the main thread.
这篇关于如何将PDF转换为Excel或CSV在Rails 4的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!