问题描述
我有一个包含2000多个rtf文档的文件夹.我想将它们导入r(最好是可以与 tidytext 结合使用的数据框)包裹).此外,我还需要添加一个列,即添加文件名,以便可以将每个rtf文档的内容链接到文件名(以后,我还将不得不从文件名中提取信息并将其保存到数据集的单独列中).
I have a folder with more than 2,000 rtf documents. I want to import them into r (preferable into a data frame that can be used in combination with the tidytext package). In addition, I need an additional column, adding the filename so that I can link the content of each rtf document to the filename (later, I will also have to extract information from the filename and save it into seperate columns of my data set).
我遇到了 Jens Leerssen 提出的一种解决方案,该解决方案旨在适应我的要求:
I came across a solution by Jens Leerssen that I tried to adapt to my requirements:
require(textreadr)
read_plus <- function(flnm) {
read_rtf(flnm) %>%
mutate(filename = flnm)
}
tbl_with_sources <-
list.files(path= "./data", pattern = "*.rtf",
full.names = TRUE) %>%
map_df(~read_plus(.))
但是,我收到以下错误消息:
However, I get the following error message:
任何人都可以告诉我为什么会发生此错误,或者提出其他解决方案来解决我的问题吗?
Can anyone tell me why this error occurs or propose another solution to my problem?
推荐答案
我终于通过一些解决方法解决了这个问题.
I finally solved the problem, with some workaround.
1)我通过在MacOSX终端中使用 textutil
命令将* .rft文件转换为* .txt文件:
1) I converted the *.rft files to *.txt files by using the textutil
command in the MacOSX terminal:
find . -name \*.rtf -print0 | xargs -0 textutil -convert txt
这样做,我也摆脱了格式化.
By doing so, I get also rid of formatting.
2)然后,我使用了Jens Lerrssen的 read_plus
函数.但是我现在使用 read.delim
而不是 read_rtf
,并包括两个选项( stringsAsFactors
和 quote
)来摆脱警告和/或错误:
2) I then used the read_plus
function of Jens Lerrssen. However I now use read.delim
instead of read_rtf
and included two options (stringsAsFactors
and quote
) to get rid of warnings and/or errors:
read_plus <- function(flnm) {
read.delim(flnm, header = FALSE, stringsAsFactors = FALSE, quote = "") %>%
mutate(filename = flnm)
}
3)最后,我读取了所有* .txt文件,并在最后将列重命名为 V1
.
3) Finally, I read in all the *.txt files and renamed the columnn V1
at the end.
df <- list.files(path = "./data", pattern = "*.txt",
full.names = TRUE) %>%
map_df(~read_plus(.)) %>%
rename(paragraph = V1)
这篇关于在R中读取多个* .rtf文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!