在哪里可以找到比levenshtein()和php same_text()方法更准确地估计错位字符的拼写的算法?

例子:

similar_text('jonas', 'xxjon', $similar); echo $similar; // returns 60
similar_text('jonas', 'asjon', $similar); echo $similar; // returns 60 <- although more similar!
echo levenshtein('jonas', 'xxjon'); // returns 4
echo levenshtein('jonas', 'asjon'); // returns 4  <- although more similar!

/乔纳斯

最佳答案

这是我提出的解决方案。它基于蒂姆的建议,即比较后续角色的顺序。一些结果:

  • jonas/jonax:0.8
  • jonas/sjona:0.68
  • jonas/sjonas:0.66
  • jonas/asjon:0.52
  • jonas/xxjon:0.36

  • 我确定我并不完美,并且可以对其进行优化,但是尽管如此,它似乎仍能产生我所追求的结果...
    一个弱点是,当字符串具有不同的长度时,在交换值时会产生不同的结果...
    static public function string_compare($str_a, $str_b)
    {
        $length = strlen($str_a);
        $length_b = strlen($str_b);
    
        $i = 0;
        $segmentcount = 0;
        $segmentsinfo = array();
        $segment = '';
        while ($i < $length)
        {
            $char = substr($str_a, $i, 1);
            if (strpos($str_b, $char) !== FALSE)
            {
                $segment = $segment.$char;
                if (strpos($str_b, $segment) !== FALSE)
                {
                    $segmentpos_a = $i - strlen($segment) + 1;
                    $segmentpos_b = strpos($str_b, $segment);
                    $positiondiff = abs($segmentpos_a - $segmentpos_b);
                    $posfactor = ($length - $positiondiff) / $length_b; // <-- ?
                    $lengthfactor = strlen($segment)/$length;
                    $segmentsinfo[$segmentcount] = array( 'segment' => $segment, 'score' => ($posfactor * $lengthfactor));
                }
                else
                {
                    $segment = '';
                    $i--;
                    $segmentcount++;
                }
            }
            else
            {
                $segment = '';
                $segmentcount++;
            }
            $i++;
        }
    
        // PHP 5.3 lambda in array_map
        $totalscore = array_sum(array_map(function($v) { return $v['score'];  }, $segmentsinfo));
        return $totalscore;
    }
    

    09-16 11:49