问题描述
我有一个像这样的 tm Corpus 对象:
I have a tm Corpus object like this:
> summary(corp.eng)
A corpus with 154 text documents
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
语料库中每个文档的元数据如下所示:
The metadata for each document in the corpus looks this:
> meta(corp.eng[[1]])
Available meta data pairs are:
Author :
DateTimeStamp: 2013-04-18 14:37:24
Description :
Heading :
ID : Smith-John_e.txt
Language : en_CA
Origin :
我知道我可以一次设置一个文档的作者:
I know that I can set the Author of one document at a time with this:
meta(corp.eng[[1]],tag="Author") <-
paste(
rev(
unlist(
strsplit(meta(corp.eng[[1]],tag="ID"), c("[-_]"))
)[1:2]
), collapse=' ')
这给了我这样的结果:
> meta(corp.eng[[1]],tag="Author")
[1] "John Smith"
如何批量处理作业?
推荐答案
注意:这应该仍然是一个注释,但有一些工作部分,所以这里有一个例子:
NOTE: This should still probably be a comment, but there is some working portion, so here goes an example:
data(crude)
extracted.values <- meta(crude,tag="Places",type="local")
for (i in seq_along(extracted.values)) {
meta(crude[[i]],tag="Places") <- substr(extracted.values[[i]],1,3)
}
人们应该也可以使用 lapply
来做到这一点,但由于我不熟悉 tm
的内部工作原理,我将坚持使用循环.将 substr
函数替换为您需要的函数,当然还有左侧的数据.希望这可以帮助.
One should be able to do it using lapply
as well, but as I am not familiar with the inner workings of tm
, I'll stick with loop. Substitute the substr
function with the one you need, and the data on the left side as well of course. Hope this helps.
这篇关于如何通过解析文档 ID 为语料库中的每个文档设置作者的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!