我正在使用R从文本中提取包含特定人名的句子,这是一个示例段落:


在蒂宾根(Tübingen)被反对为改革家时,他接受了叔叔约翰·鲁赫林(Johann Reuchlin)推荐的马丁·路德(Martin Luther)致电威登堡大学。 Melanchthon于21岁时成为维滕贝格(Wittenberg)希腊语教授。他研究圣经,尤其是保罗的经文和福音派教义。他曾作为观众参加莱比锡(1519年)的争论,但他的评论参加了会议。约翰·埃克(Johann Eck)抨击了他的观点,梅兰希顿(Melanchthon)根据圣经在他的反抗派约翰·内姆·埃基姆(Johannem Eckium)中的权威回答。


在此简短的段落中,有几个人的名字,例如:
Johann Reuchlin,Melanchthon,Johann Eck。借助openNLP软件包,可以正确提取和识别三个人的名字Martin Luther,Paul和Melanchthon。然后我有两个问题:


如何提取包含这些名称的句子?
由于命名实体识别器的输出不太理想,如果在每个名称(例如[[Johann Reuchlin]],[[Melanchthon])上都添加“ [[]]”,那么如何提取包含这些名称表达式[[ A]],[[B]] ...?

最佳答案

Using `strsplit` and `grep`, first I set made an object `para` which was your paragraph.

toMatch <- c("Martin Luther", "Paul", "Melanchthon")

unlist(strsplit(para,split="\\."))[grep(paste(toMatch, collapse="|"),unlist(strsplit(para,split="\\.")))]


> unlist(strsplit(para,split="\\."))[grep(paste(toMatch, collapse="|"),unlist(strsplit(para,split="\\.")))]
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"
[3] " He studied the Scripture, especially of Paul, and Evangelical doctrine"
[4] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"


或更清洁一点:

sentences<-unlist(strsplit(para,split="\\."))
sentences[grep(paste(toMatch, collapse="|"),sentences)]


如果您要查找每个人作为单独返回的句子,则:

toMatch <- c("Martin Luther", "Paul", "Melanchthon")
sentences<-unlist(strsplit(para,split="\\."))
foo<-function(Match){sentences[grep(Match,sentences)]}
lapply(toMatch,foo)

[[1]]
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"

[[2]]
[1] " He studied the Scripture, especially of Paul, and Evangelical doctrine"

[[3]]
[1] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"
[2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"


编辑3:要添加每个人的姓名,请执行以下简单操作:

foo<-function(Match){c(Match,sentences[grep(Match,sentences)])}


编辑4:

并且,如果您想查找包含多个人/地点/事物(单词)的句子,则只需为这两个变量添加一个参数,例如:

toMatch <- c("Martin Luther", "Paul", "Melanchthon","(?=.*Melanchthon)(?=.*Scripture)")


并将perl更改为TRUE

foo<-function(Match){c(Match,sentences[grep(Match,sentences,perl = T)])}


> lapply(toMatch,foo)
[[1]]
[1] "Martin Luther"
[2] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"

[[2]]
[1] "Paul"
[2] " He studied the Scripture, especially of Paul, and Evangelical doctrine"

[[3]]
[1] "Melanchthon"
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"
[3] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"

[[4]]
[1] "(?=.*Melanchthon)(?=.*Scripture)"
[2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"


编辑5:回答您的其他问题:

鉴于:

sentenceR<-"Opposed as a reformer at [[Tübingen]], he accepted a call to the University of [[Wittenberg]] by [[Martin Luther]], recommended by his great-uncle [[Johann Reuchlin]]"

gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]])


将在双括号内给您单词。

> gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]])
[1] "Tübingen"        "Wittenberg"      "Martin Luther"   "Johann Reuchlin"

07-24 17:02