本文介绍了如何使用R提取包含特定人名的句子的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 R 从文本中提取包含特定人名的句子，这里是一个示例段落:

I am using R to extract sentences containing specific person names from texts and here is a sample paragraph:

作为蒂宾根改革者的反对者，他接受了马丁·路德 (Martin Luther) 到维滕贝格大学 (University of Martin Luther) 的邀请，并得到了他的叔叔约翰·鲁伊奇林 (Johann Reuchlin) 的推荐.Melanchthon 21 岁时成为维滕贝格的希腊语教授.他研究圣经，特别是保罗的圣经和福音派教义.他作为旁观者出席了莱比锡的争论(1519 年)，但参与了他的评论.Johann Eck 攻击了他的观点，Melanchthon 在他的 Defensio contra Johannem Eckium 中根据圣经的权威回答.

在这个简短的段落中，有几个人名，例如:Johann Reuchlin、Melanchthon、Johann Eck.在openNLP包的帮助下，可以正确提取和识别三个人名Martin Luther、Paul和Melanchthon.那么我有两个问题:

In this short paragraph, there are several person names such as:Johann Reuchlin, Melanchthon, Johann Eck. With the help of openNLP package, three person names Martin Luther, Paul and Melanchthon can be correctly extracted and recognized. Then I have two questions:

如何提取包含这些名字的句子?
由于命名实体识别器的输出不是那么有希望，如果我在每个名称中添加[[]]"，例如[[Johann Reuchlin]]、[[Melanchthon]]，我如何提取句子包含这些名称表达式 [[A]], [[B]] ...?

How could I extract sentences containing these names?
As the output of named entity recognizer is not so promising, if I add "[[ ]]" to each name such as [[Johann Reuchlin]], [[Melanchthon]], how could I extract sentences containing these name expressions [[A]], [[B]] ...?

推荐答案

Using `strsplit` and `grep`, first I set made an object `para` which was your paragraph.

toMatch <- c("Martin Luther", "Paul", "Melanchthon")

unlist(strsplit(para,split="\\."))[grep(paste(toMatch, collapse="|"),unlist(strsplit(para,split="\\.")))]


> unlist(strsplit(para,split="\\."))[grep(paste(toMatch, collapse="|"),unlist(strsplit(para,split="\\.")))]
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"                                                                    
[3] " He studied the Scripture, especially of Paul, and Evangelical doctrine"                                                                               
[4] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"

或者更干净一点:

sentences<-unlist(strsplit(para,split="\\."))
sentences[grep(paste(toMatch, collapse="|"),sentences)]

如果您要查找每个人所在的句子作为单独的返回值，则:

If you are looking for the sentences that each person is in as separate returns then:

toMatch <- c("Martin Luther", "Paul", "Melanchthon")
sentences<-unlist(strsplit(para,split="\\."))
foo<-function(Match){sentences[grep(Match,sentences)]}
lapply(toMatch,foo)

[[1]]
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"

[[2]]
[1] " He studied the Scripture, especially of Paul, and Evangelical doctrine"

[[3]]
[1] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"                                                   
[2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"

编辑 3:要添加每个人的姓名，请执行一些简单的操作，例如:

Edit 3: To add each persons name, do something simple such as:

foo<-function(Match){c(Match,sentences[grep(Match,sentences)])}

编辑 4:

如果您想找到包含多个人/地点/事物(词)的句子，只需为这两个添加一个参数，例如:

EDIT 4:

And if you wanted to find sentences that had multiple people/places/things (words), then just add an argument for those two such as:

toMatch <- c("Martin Luther", "Paul", "Melanchthon","(?=.*Melanchthon)(?=.*Scripture)")

并将 perl 更改为 TRUE:

foo<-function(Match){c(Match,sentences[grep(Match,sentences,perl = T)])}


> lapply(toMatch,foo)
[[1]]
[1] "Martin Luther"                                                                                                                                         
[2] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"

[[2]]
[1] "Paul"                                                                   
[2] " He studied the Scripture, especially of Paul, and Evangelical doctrine"

[[3]]
[1] "Melanchthon"                                                                                                                          
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"                                                   
[3] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"

[[4]]
[1] "(?=.*Melanchthon)(?=.*Scripture)"                                                                                                     
[2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"

编辑 5:回答你的另一个问题:

给定:

sentenceR<-"Opposed as a reformer at [[Tübingen]], he accepted a call to the University of [[Wittenberg]] by [[Martin Luther]], recommended by his great-uncle [[Johann Reuchlin]]"

gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]])

会给你双括号内的词.

> gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]])
[1] "Tübingen"        "Wittenberg"      "Martin Luther"   "Johann Reuchlin"

这篇关于如何使用R提取包含特定人名的句子的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！