问题描述
我正在尝试从一些 pdf 报告中的表格中提取数据.
I'm trying to extract data from tables inside some pdf reports.
我已经看到一些使用 pdftools 和类似软件包的示例,我成功获取了文本,但是,我只想提取表格.
I've seen some examples using either pdftools and similar packages I was successful in getting the text, however, I just want to extract the tables.
有没有办法使用 R 来识别和提取表格?
Is there a way to use R to recognize and extract only tables?
推荐答案
好问题,我最近也在想同样的事情,谢谢!
Awsome question, I wondered about the same thing recently, thanks!
我做到了,使用 tabulizer ‘0.2.2’
正如 @hrbrmstr 所建议的那样.如果您使用 R >3.5.x,我提供以下解决方案.按特定顺序安装三个包:
I did it, with tabulizer ‘0.2.2’
as @hrbrmstr also suggests. If you are using R > 3.5.x, I'm providing following solution. Install the three packages in specific order:
# install.packages("rJava")
# library(rJava) # load and attach 'rJava' now
# install.packages("devtools")
# devtools::install_github("ropensci/tabulizer", args="--no-multiarch")
更新: 再次测试该方法后,看起来只需执行 install.packages("tabulizer")
现在.rJava
将作为依赖项自动安装.
Update: After just testing the approach again, it looks like it's enough to just do install.packages("tabulizer")
now. rJava
will be installed automatically as a dependency.
现在您可以从 PDF 报告中提取表格了.
Now you are ready to extract tables from your PDF reports.
library(tabulizer)
## load report
l <- "https://sedl.org/afterschool/toolkits/science/pdf/ast_sci_data_tables_sample.pdf"
m <- extract_tables(l, encoding="UTF-8")[[2]] ## comes as a character matrix
## Note: peep into `?extract_tables` for further specs (page, location etc.)!
## use first row as column names
dat <- setnames(type.convert(as.data.frame(m[-1, ]), as.is=TRUE), m[1, ])
## example-specific date conversion
dat$Date <- as.POSIXlt(dat$Date, format="%m/%d/%y")
dat <- within(dat, Date$year <- ifelse(Date$year > 120, Date$year - 100, Date$year))
dat ## voilà
# Speed (mph) Driver Car Engine Date
# 1 407.447 Craig Breedlove Spirit of America GE J47 1963-08-05
# 2 413.199 Tom Green Wingfoot Express WE J46 1964-10-02
# 3 434.220 Art Arfons Green Monster GE J79 1964-10-05
# 4 468.719 Craig Breedlove Spirit of America GE J79 1964-10-13
# 5 526.277 Craig Breedlove Spirit of America GE J79 1965-10-15
# 6 536.712 Art Arfons Green Monster GE J79 1965-10-27
# 7 555.127 Craig Breedlove Spirit of America, Sonic 1 GE J79 1965-11-02
# 8 576.553 Art Arfons Green Monster GE J79 1965-11-07
# 9 600.601 Craig Breedlove Spirit of America, Sonic 1 GE J79 1965-11-15
# 10 622.407 Gary Gabelich Blue Flame Rocket 1970-10-23
# 11 633.468 Richard Noble Thrust 2 RR RG 146 1983-10-04
# 12 763.035 Andy Green Thrust SSC RR Spey 1997-10-15
希望它对你有用.
限制:当然,这个例子中的表格非常简单,也许你不得不使用 gsub
和类似的东西.
Limitations: Of course, the table in this example is quite simple and maybe you have to mess around with gsub
and this kind of stuff.
这篇关于使用 R 识别 PDF 表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!