问题描述
我有问题,我想使用句号(.)将文本拆分为句子.
I have a problem, I want split a text into sentence using fullstop (.)
例如:
先生.Bean 是一部英国喜剧电视连续剧,每集14个半小时,由罗文·阿特金森(Rowan Atkinson)担任主角.Atkinson,Robin Driscoll,Richard Curtis和Ben Elton分别撰写了不同的剧集.
Mr. Bean is a British comedy television series of 14 half-hour episodes starring Rowan Atkinson as the title character. Different episodes were written by Atkinson, Robin Driscoll, Richard Curtis and one by Ben Elton.
如果我分割上面的文本,我会得到3个句子,
If I split the above text, I got 3 sentences like,
1.先生
2..Bean是英国喜剧电视连续剧,每集14个半小时,由罗文·阿特金森(Rowan Atkinson)担任主角.Atkinson,Robin Driscoll,Richard Curtis和Ben Elton分别撰写了不同的剧集.
2. Bean is a British comedy television series of 14 half-hour episodes starring Rowan Atkinson as the title character. Different episodes were written by Atkinson, Robin Driscoll, Richard Curtis and one by Ben Elton.
3..不同的情节分别由Atkinson,Robin Driscoll,Richard Curtis和Ben Elton撰写.
3. Different episodes were written by Atkinson, Robin Driscoll, Richard Curtis and one by Ben Elton.
我想在第二句话中包括先生,因为文本应分为两个句子,而不是三个.
I want to include Mr. in the second sentence as the text should split into two sentence not to three.
1.豆先生是一部英国喜剧电视连续剧,共有14集,每集半小时,由罗文·阿特金森(Rowan Atkinson)主演.Atkinson,Robin Driscoll,Richard Curtis和Ben Elton分别撰写了不同的剧集.
1. Mr. Bean is a British comedy television series of 14 half-hour episodes starring Rowan Atkinson as the title character. Different episodes were written by Atkinson, Robin Driscoll, Richard Curtis and one by Ben Elton.
2..不同的情节分别由Atkinson,Robin Driscoll,Richard Curtis和Ben Elton撰写.
2. Different episodes were written by Atkinson, Robin Driscoll, Richard Curtis and one by Ben Elton.
请帮助我.我感谢社区的即时反馈.
Kindly help me. I appreciate the instant feedback from the community.
谢谢.
推荐答案
如果您正在寻找一种避免在缩写后对句子进行分割的方法(例如 am
),那么这是一个困难的自然语言问题.
If you are looking for a way to avoid splitting sentences after an abbreviation (like a.m.
), that's a difficult natural language problem.
如果您只想拆分句子而不必担心先生或太太(并且有一个字符不太可能显示在文本中,例如 *
),这是一种简单的方法:
If you just want to split sentences without worrying about Mr. or Mrs. (and have a character that won't likely show up in the text, like *
), here's a simple way:
- 将
Mr.
和Mrs.
的所有实例替换为Mr *
和Mrs *
- 在
上分割文本.
- 在结果数组中,将
Mr *
和Mrs *
的所有实例替换为Mr.
和Mrs.
- replace all instances of
Mr.
andMrs.
withMr*
andMrs*
- split text on
.
- in the resulting array, replace all instances of
Mr*
andMrs*
withMr.
andMrs.
这是一个使用NUL作为前哨字符的版本,因为它几乎不可能无意间显示在文本中:
Here's a version that uses NUL as a sentinel character, as it's pretty much impossible for it to show up in text unintentionally:
static IEnumerable<string> Splitter(string sentences)
{
char sentinel = '\0';
return sentences.Replace("Mr.", "Mr" + sentinel)
.Replace("Mrs.", "Mrs" + sentinel)
.Split(new[] { ". " }, StringSplitOptions.None)
.Select(s => s.Replace("Mr" + sentinel, "Mr.")
.Replace("Mrs" + sentinel, "Mrs."));
}
如果您是那种偏执狂的人,认为任何特定字符都可能出现在您的文本中,请随时使用GUID作为前哨.
If you're the paranoid sort of person who thinks any particular character is liable to show up in your text, feel free to use a GUID for the sentinel.
这篇关于即使文本中存在太太,也将文本拆分为句子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!