php - 获取最常用的带有特殊字符的单词

我想从数组中得到最常用的单词。唯一的问题是瑞典字符（_、_和_）只显示为。

$string = 'This is just a test post with the Swedish characters Å, Ä, and Ö. Also as lower cased characters: å, ä, and ö.';
echo '<pre>';
print_r(array_count_values(str_word_count($string, 1, 'àáãâçêéíîóõôúÀÁÃÂÇÊÉÍÎÓÕÔÚ')));
echo '</pre>';

该代码将输出以下内容：

Array
(
    [This] => 1
    [is] => 1
    [just] => 1
    [a] => 1
    [test] => 1
    [post] => 1
    [with] => 1
    [the] => 1
    [Swedish] => 1
    [characters] => 2
    [�] => 1
    [�] => 1
    [and] => 2
    [�] => 1
    [Also] => 1
    [as] => 1
    [lower] => 1
    [cased] => 1
    [�] => 1
    [�] => 1
    [�] => 1
)

我怎样才能“看到”瑞典字符和其他特殊字符？

最佳答案

下面是一个使用正则表达式的解决方案，它使用Unicode标点来拆分“单词”，然后只使用常规数组出现次数。

array_count_values(preg_split('/[[:punct:]\s]+/u', $string, -1, PREG_SPLIT_NO_EMPTY));

生产：

Array
(
    [This] => 1
    [is] => 1
    [just] => 1
    [a] => 1
    [test] => 1
    [post] => 1
    [with] => 1
    [the] => 1
    [Swedish] => 1
    [characters] => 2
    [Å] => 1
    [Ä] => 1
    [and] => 2
    [Ö] => 1
    [Also] => 1
    [as] => 1
    [lower] => 1
    [cased] => 1
    [å] => 1
    [ä] => 1
    [ö] => 1
)

这是在Unicode控制台中测试的，如果您使用浏览器，您可能需要配置一个编码。在浏览器中创建<meta>标记或设置编码，或者发送php头。