本文介绍了使用正则表达式将字符串拆分为句子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在$sentences中存储了随机文本.我想使用正则表达式将文本拆分为句子,请参见:

I have random text stored in $sentences. Using regex, I want to split the text into sentences, see:

function splitSentences($text) {
    $re = '/                # Split sentences on whitespace between them.
        (?<=                # Begin positive lookbehind.
          [.!?]             # Either an end of sentence punct,
        | [.!?][\'"]        # or end of sentence punct and quote.
        )                   # End positive lookbehind.
        (?<!                # Begin negative lookbehind.
          Mr\.              # Skip either "Mr."
        | Mrs\.             # or "Mrs.",
        | T\.V\.A\.         # or "T.V.A.",
                            # or... (you get the idea).
        )                   # End negative lookbehind.
        \s+                 # Split on whitespace between sentences.
        /ix';

    $sentences = preg_split($re, $text, -1, PREG_SPLIT_NO_EMPTY);
    return $sentences;
}

$sentences = splitSentences($sentences);

print_r($sentences);

它工作正常.

但是,如果有Unicode字符,它不会分成句子:

However, it doesn't split into sentences if there are unicode characters:

$sentences = 'Entertainment media properties. Fairy Tail and Tokyo Ghoul.';

或者这种情况:

$sentences = "Entertainment media properties.&Acirc;&nbsp; Fairy Tail and Tokyo Ghoul.";

当文本中存在Unicode字符时,我该怎么做才能使其正常工作?

What can I do to make it work when unicode characters exist in the text?

这是用于测试的 ideone .

我正在寻找一个完整的解决方案.在发布答案之前,请阅读我与WiktorStribiżew在一起的评论主题,以获取有关此问题的更多相关信息.

I am looking for a complete solution to this. Before posting an answer, please read the comment thread I had with WiktorStribiżew for more relevant info on this issue.

推荐答案

正如应该预期的那样,任何自然语言处理都不是一件容易的事.其原因是它们是进化系统.没有一个人坐下来思考哪些是好主意,哪些不是.每个规则都有20-40%的例外.话虽这么说,但可以进行出价的单个正则表达式的复杂性已超出图表.尽管如此,以下解决方案仍主要依靠正则表达式.

As it should be expected, any sort of natural language processing is not a trivial task. The reason for it is that they are evolutionary systems. There is no single person who sat down and thought about which are good ideas and which - not. Every rule has 20-40% exceptions. With that said the complexity of a single regex that can do your bidding would be off the charts. Still, the following solution relies mainly on regexes.

  • 想法是逐步遍历文本.
  • 在任何给定时间,文本的当前块将包含在两个不同的部分中.一个是候选字符串之前的候选字符串,另一个是子字符串之后的候选字符串.
  • 前10个正则表达式对检测看起来像句子边界但实际上不是的位置.在这种情况下,之前之后无需输入新句子即可前进.
  • 如果这些对均不匹配,将尝试与最后3对进行匹配,可能会检测到边界.
  • The idea is to gradually go over the text.
  • At any given time, the current chunk of the text will be contained in two different parts. One, which is the candidate for a substring before a sentence boundary and another - after.
  • The first 10 regex pairs detect positions which look like sentence boundaries, but actually aren't. In that case, before and after are advanced without registering a new sentence.
  • If none of these pairs matches, matching will be attempted with the last 3 pairs, possibly detecting a boundary.

这些的来源是什么?-我翻译了此Ruby库,它是基于本文.如果您真的想了解它们,除了阅读本文之外别无选择.

As for where did these regexes come from? - I translated this Ruby library, which is generated based on this paper. If you truly want to understand them, there is no alternative but to read the paper.

就准确性而言,我建议您使用其他文字进行测试.经过一番实验,我感到非常惊喜.

As far as accuracy goes - I encourage you to test it with different texts. After some experimentation, I was very pleasantly surprised.

就性能而言-正则表达式应具有很高的性能,因为它们都具有\A\Z锚,几乎没有重复量词,并且在某些地方-没有任何重复量回溯. regexes仍然是regexes.如果您打算在大量文本上使用紧密循环,则必须进行一些基准测试.

In terms of performance - the regexes should be highly performant as all of them have either a \A or \Z anchor, there are almost no repetition quantifiers, and in the places there are - there can't be any backtracking. Still, regexes are regexes. You will have to do some benchmarking if you plan to use this is tight loops on huge chunks of text.

强制性免责声明:请原谅我生锈的php技能.以下代码可能不是有史以来最惯用的php,但仍然应该足够清楚,以使您明白这一点.

Mandatory disclaimer: excuse my rusty php skills. The following code might not be the most idiomatic php ever, it should still be clear enough to get the point across.

function sentence_split($text) {
    $before_regexes = array('/(?:(?:[\'\"„][\.!?…][\'\""]\s)|(?:[^\.]\s[A-Z]\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s[A-Z]\.\s)|(?:\bApr\.\s)|(?:\bAug\.\s)|(?:\bBros\.\s)|(?:\bCo\.\s)|(?:\bCorp\.\s)|(?:\bDec\.\s)|(?:\bDist\.\s)|(?:\bFeb\.\s)|(?:\bInc\.\s)|(?:\bJan\.\s)|(?:\bJul\.\s)|(?:\bJun\.\s)|(?:\bMar\.\s)|(?:\bNov\.\s)|(?:\bOct\.\s)|(?:\bPh\.?D\.\s)|(?:\bSept?\.\s)|(?:\b\p{Lu}\.\p{Lu}\.\s)|(?:\b\p{Lu}\.\s\p{Lu}\.\s)|(?:\bcf\.\s)|(?:\be\.g\.\s)|(?:\besp\.\s)|(?:\bet\b\s\bal\.\s)|(?:\bvs\.\s)|(?:\p{Ps}[!?]+\p{Pe} ))\Z/su',
        '/(?:(?:[\.\s]\p{L}{1,2}\.\s))\Z/su',
        '/(?:(?:[\[\(]*\.\.\.[\]\)]* ))\Z/su',
        '/(?:(?:\b(?:pp|[Vv]iz|i\.?\s*e|[Vvol]|[Rr]col|maj|Lt|[Ff]ig|[Ff]igs|[Vv]iz|[Vv]ols|[Aa]pprox|[Ii]ncl|Pres|[Dd]ept|min|max|[Gg]ovt|lb|ft|c\.?\s*f|vs)\.\s))\Z/su',
        '/(?:(?:\b[Ee]tc\.\s))\Z/su',
        '/(?:(?:[\.!?…]+\p{Pe} )|(?:[\[\(]*…[\]\)]* ))\Z/su',
        '/(?:(?:\b\p{L}\.))\Z/su',
        '/(?:(?:\b\p{L}\.\s))\Z/su',
        '/(?:(?:\b[Ff]igs?\.\s)|(?:\b[nN]o\.\s))\Z/su',
        '/(?:(?:[\""\']\s*))\Z/su',
        '/(?:(?:[\.!?…][\x{00BB}\x{2019}\x{201D}\x{203A}\"\'\p{Pe}\x{0002}]*\s)|(?:\r?\n))\Z/su',
        '/(?:(?:[\.!?…][\'\"\x{00BB}\x{2019}\x{201D}\x{203A}\p{Pe}\x{0002}]*))\Z/su',
        '/(?:(?:\s\p{L}[\.!?…]\s))\Z/su');
    $after_regexes = array('/\A(?:)/su',
        '/\A(?:[\p{N}\p{Ll}])/su',
        '/\A(?:[^\p{Lu}])/su',
        '/\A(?:[^\p{Lu}]|I)/su',
        '/\A(?:[^p{Lu}])/su',
        '/\A(?:\p{Ll})/su',
        '/\A(?:\p{L}\.)/su',
        '/\A(?:\p{L}\.\s)/su',
        '/\A(?:\p{N})/su',
        '/\A(?:\s*\p{Ll})/su',
        '/\A(?:)/su',
        '/\A(?:\p{Lu}[^\p{Lu}])/su',
        '/\A(?:\p{Lu}\p{Ll})/su');
    $is_sentence_boundary = array(false, false, false, false, false, false, false, false, false, false, true, true, true);
    $count = 13;

    $sentences = array();
    $sentence = '';
    $before = '';
    $after = substr($text, 0, 10);
    $text = substr($text, 10);

    while($text != '') {
        for($i = 0; $i < $count; $i++) {
            if(preg_match($before_regexes[$i], $before) && preg_match($after_regexes[$i], $after)) {
                if($is_sentence_boundary[$i]) {
                    array_push($sentences, $sentence);
                    $sentence = '';
                }
                break;
            }
        }

        $first_from_text = $text[0];
        $text = substr($text, 1);
        $first_from_after = $after[0];
        $after = substr($after, 1);
        $before .= $first_from_after;
        $sentence .= $first_from_after;
        $after .= $first_from_text;
    }

    if($sentence != '' && $after != '') {
        array_push($sentences, $sentence.$after);
    }

    return $sentences;
}

$text = "Mr. Entertainment media properties. Fairy Tail 3.5 and Tokyo Ghoul.";
print_r(sentence_split($text));

这篇关于使用正则表达式将字符串拆分为句子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-15 01:21