问题描述
我正在使用 R 从文本中提取包含特定人名的句子,这里是一个示例段落:
I am using R to extract sentences containing specific person names from texts and here is a sample paragraph:
作为蒂宾根改革者的反对者,他接受了马丁·路德 (Martin Luther) 到维滕贝格大学 (University of Martin Luther) 的邀请,并得到了他的叔叔约翰·鲁伊奇林 (Johann Reuchlin) 的推荐.Melanchthon 21 岁时成为维滕贝格的希腊语教授.他研究圣经,特别是保罗的圣经和福音派教义.他作为旁观者出席了莱比锡的争论(1519 年),但参与了他的评论.Johann Eck 攻击了他的观点,Melanchthon 在他的 Defensio contra Johannem Eckium 中根据圣经的权威回答.
在这个简短的段落中,有几个人名,例如:Johann Reuchlin、Melanchthon、Johann Eck.在openNLP包的帮助下,可以正确提取和识别三个人名Martin Luther、Paul和Melanchthon.那么我有两个问题:
In this short paragraph, there are several person names such as:Johann Reuchlin, Melanchthon, Johann Eck. With the help of openNLP package, three person names Martin Luther, Paul and Melanchthon can be correctly extracted and recognized. Then I have two questions:
- 如何提取包含这些名字的句子?
- 由于命名实体识别器的输出不是那么有希望,如果我在每个名称中添加[[]]",例如[[Johann Reuchlin]]、[[Melanchthon]],我如何提取句子包含这些名称表达式 [[A]], [[B]] ...?
- How could I extract sentences containing these names?
- As the output of named entity recognizer is not so promising, if I add "[[ ]]" to each name such as [[Johann Reuchlin]], [[Melanchthon]], how could I extract sentences containing these name expressions [[A]], [[B]] ...?
推荐答案
Using `strsplit` and `grep`, first I set made an object `para` which was your paragraph.
toMatch <- c("Martin Luther", "Paul", "Melanchthon")
unlist(strsplit(para,split="\\."))[grep(paste(toMatch, collapse="|"),unlist(strsplit(para,split="\\.")))]
> unlist(strsplit(para,split="\\."))[grep(paste(toMatch, collapse="|"),unlist(strsplit(para,split="\\.")))]
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"
[3] " He studied the Scripture, especially of Paul, and Evangelical doctrine"
[4] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
或者更干净一点:
sentences<-unlist(strsplit(para,split="\\."))
sentences[grep(paste(toMatch, collapse="|"),sentences)]
如果您要查找每个人所在的句子作为单独的返回值,则:
If you are looking for the sentences that each person is in as separate returns then:
toMatch <- c("Martin Luther", "Paul", "Melanchthon")
sentences<-unlist(strsplit(para,split="\\."))
foo<-function(Match){sentences[grep(Match,sentences)]}
lapply(toMatch,foo)
[[1]]
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[[2]]
[1] " He studied the Scripture, especially of Paul, and Evangelical doctrine"
[[3]]
[1] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"
[2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
编辑 3:要添加每个人的姓名,请执行一些简单的操作,例如:
Edit 3: To add each persons name, do something simple such as:
foo<-function(Match){c(Match,sentences[grep(Match,sentences)])}
编辑 4:
如果您想找到包含多个人/地点/事物(词)的句子,只需为这两个添加一个参数,例如:
EDIT 4:
And if you wanted to find sentences that had multiple people/places/things (words), then just add an argument for those two such as:
toMatch <- c("Martin Luther", "Paul", "Melanchthon","(?=.*Melanchthon)(?=.*Scripture)")
并将 perl
更改为 TRUE
:
foo<-function(Match){c(Match,sentences[grep(Match,sentences,perl = T)])}
> lapply(toMatch,foo)
[[1]]
[1] "Martin Luther"
[2] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[[2]]
[1] "Paul"
[2] " He studied the Scripture, especially of Paul, and Evangelical doctrine"
[[3]]
[1] "Melanchthon"
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"
[3] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
[[4]]
[1] "(?=.*Melanchthon)(?=.*Scripture)"
[2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
编辑 5:回答你的另一个问题:
给定:
sentenceR<-"Opposed as a reformer at [[Tübingen]], he accepted a call to the University of [[Wittenberg]] by [[Martin Luther]], recommended by his great-uncle [[Johann Reuchlin]]"
gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]])
会给你双括号内的词.
> gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]])
[1] "Tübingen" "Wittenberg" "Martin Luther" "Johann Reuchlin"
这篇关于如何使用R提取包含特定人名的句子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!