如果文本包含德语Umlauts [äöü],则preg_match_all的结果具有错误的偏移量(似乎每个Umlaut将偏移量扩展了1)
我需要每个单词的位置,因为它们将被其他字符串替换。使用此工具https://regex101.com/r/UosqVD/2可以正常工作,匹配具有正确的起始值。
$pattern = "~\b\w+\b~u";
$text = "Käthe würde gerne wählen.";
if (preg_match_all($pattern, $text, $matches, PREG_OFFSET_CAPTURE)) {
foreach ($matches[0] as $m) {
echo $m[0]."; ".$m[1]."; ".mb_strlen($m[0], "utf-8")."<br />";
}
}
Text; Start, Length<br>
Käthe; 0; 5<br>
würde; 7; 5<br>
gerne; 14; 5<br>
wählen; 20; 6<br>
最佳答案
PHP documentation包含一个由用户编写的函数mb_preg_match_all()
,它似乎符合您的需求:
function mb_preg_match_all($ps_pattern, $ps_subject, &$pa_matches, $pn_flags = PREG_PATTERN_ORDER, $pn_offset = 0, $ps_encoding = NULL) {
// WARNING! - All this function does is to correct offsets, nothing else:
//
if (is_null($ps_encoding))
$ps_encoding = mb_internal_encoding();
$pn_offset = strlen(mb_substr($ps_subject, 0, $pn_offset, $ps_encoding));
$ret = preg_match_all($ps_pattern, $ps_subject, $pa_matches, $pn_flags, $pn_offset);
if ($ret && ($pn_flags & PREG_OFFSET_CAPTURE))
foreach($pa_matches as &$ha_match)
foreach($ha_match as &$ha_match)
$ha_match[1] = mb_strlen(substr($ps_subject, 0, $ha_match[1]), $ps_encoding);
//
// (code is independent of PREG_PATTER_ORDER / PREG_SET_ORDER)
return $ret;
}
关于php - 如何在德国乌姆勒人[äöü]上进行preg_match_all?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/56091088/