问题描述
我目前正在使用以下代码匹配HTML:
I am currently matching HTML using this code:
preg_match('/<\/?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;/u', $html, $match, PREG_OFFSET_CAPTURE, $position)
它可以完美匹配所有内容,但是如果我有一个多字节字符,则在退还该职位时会将其计为2个字符.
It matches everything perfect, however if I have a multibyte character, it counts it as 2 characters when giving back the position.
例如,返回的$match
数组将给出类似的内容:
For example the returned $match
array would give something like:
array
0 =>
array
0 => string '<br />' (length=6)
1 => int 132
1 =>
array
0 => string 'br' (length=2)
1 => int 133
<br />
匹配的实际数字为128,但是有4个多字节字符,因此为132.我真的认为添加/u修饰符会使它意识到正在发生的事情,但是在那里没有运气.
The real number for the <br />
match is 128, but there are 4 multibyte characters, so it's giving 132. I really thought adding the /u modifier would make it realize what's going on, but no luck there.
推荐答案
我从@Qtax看了这个建议:
I looked at this suggestion from @Qtax:
为获得更多参考,在使用此错误时出现了该错误:截断包含HTML的文本,忽略标签
And for some more reference, this bug surfaced while using this:Truncate text containing HTML, ignoring tags
更改的要点是:
$orig_utf = 'UTF-8';
$new_utf = 'UTF-32';
mb_regex_encoding( $new_utf );
$html = mb_convert_encoding( $html, $new_utf, $orig_utf );
$end_char = mb_convert_encoding( $end_char, $new_utf, $orig_utf );
mb_ereg_search_init( $html );
$pattern = '</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;';
$pattern = mb_convert_encoding( $pattern, $new_utf, $orig_utf );
while ( $printed < $limit && $tag_match = mb_ereg_search_pos( $pattern, $html ) ) {
$tag_position = $tag_match[0]/4;
$tag_length = $tag_match[1];
$tag = mb_substr( $html, $tag_position, $tag_length/4, $new_utf );
$tag_name = preg_replace( '/[\s<>\/]+/', '', $tag );
// Print text leading up to the tag.
$str = mb_substr($html, $position, $tag_position - $position, $new_utf );
.......
}
关于截断HTML页面,还有其他必要的更改:
Also in reference to the truncate HTML page, there are other neccessary changes:
$first_char = mb_substr( $tag, 0, 1, $new_utf );
if ( $first_char == mb_convert_encoding( '&', $new_utf ) ) {
...
}
我的文本编辑器是UTF-8,因此,如果我将32与文件的&符号进行比较,那将无法正常工作.
My text editor is UTF-8 so if I was comparing the 32 to my file's ampersand, it wouldn't work.
这篇关于如何使用preg_match在多字节字符串中获取正确的列表位置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!